artadia: 2019

Monday, October 7, 2019

Sparse Cholesky decomposition in Julia

Julia has nice built-in sparse matrices and algorithms. I spent a whole afternoon trying to extract the sparse Cholesky factorization of a sparse matrix but it did not look what I wanted. I was misunderstanding the decomposition. The documentation is not exceedingly clear but apparently it had to be done this way:

julia> A=[ 10.0 2.0 0.0 -2.0 0.0 -4.0 0.0 0.0;

2.0 2.0 0.0 -2.0 0.0 0.0 0.0 0.0;

0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0;

-2.0 -2.0 0.0 6.0 -2.0 -2.0 0.0 0.0;

0.0 0.0 0.0 -2.0 20.0 8.0 -8.0 -8.0;

-4.0 0.0 0.0 -2.0 8.0 8.0 -4.0 -4.0;

0.0 0.0 0.0 0.0 -8.0 -4.0 8.0 0.0;

0.0 0.0 0.0 0.0 -8.0 -4.0 0.0 8.0];

julia> Asp=sparse(A);

julia> C=cholesky(Asp)

julia> # Matrix(sparse(C.L)[C.p,:])

8×8 Array{Float64,2}:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.41421

0.0 0.0 0.0 -2.0 1.41421 -1.41421 1.41421 0.0

0.0 0.0 0.0 0.0 1.41421 0.0 0.0 0.0

0.0 0.0 -1.0 -1.0 -1.41421 1.41421 0.0 0.0

0.0 2.82843 0.0 0.0 0.0 0.0 0.0 0.0

-1.41421 -1.41421 8.88178e-16 2.0 0.0 0.0 0.0 0.0

2.82843 0.0 0.0 0.0 0.0 0.0 0.0 0.0

-2.82843 -2.82843 2.0 0.0 0.0 0.0 0.0 0.0

which is not lower triangular. Then I read carefully ?cholesky which is something like

cholesky(A; shift = 0.0, check = true, perm = nothing) -> CHOLMOD.Factor

Compute the Cholesky factorization of a sparse positive definite matrix A. A

must be a SparseMatrixCSC or a Symmetric/Hermitian view of a

SparseMatrixCSC. ...

reading on, it says that what is actually computed is the Cholesky factorization LL' of a permuted matrix, PAP' . The permutation minimizes the fill-ins. Example that this is the case:

julia> L=sparse(C.L)

8×8 SparseMatrixCSC{Float64,Int64} with 19 stored entries:

[1, 1] = 2.82843

[3, 1] = -2.82843

[4, 1] = -1.41421

[2, 2] = 2.82843

...

julia> Matrix(L) # is lower triangular

8×8 Array{Float64,2}:

2.82843 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 2.82843 0.0 0.0 0.0 0.0 0.0 0.0

-2.82843 -2.82843 2.0 0.0 0.0 0.0 0.0 0.0

-1.41421 -1.41421 8.88178e-16 2.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 1.41421 0.0 0.0 0.0

0.0 0.0 -1.0 -1.0 -1.41421 1.41421 0.0 0.0

0.0 0.0 0.0 -2.0 1.41421 -1.41421 1.41421 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.41421

julia> # is this the cholesky decomposition of permuted A? Yes, it is

L*L' ≈ A[C.p,C.p]

true

We can explicitely recover P and construct the (not any more lower triangular) Cholesky factor of A, P'L, which multiplied by its transpose should give A:

julia> P=sparse(1:8,C.p,ones(8))

julia> Matrix(P)

8×8 Array{Float64,2}:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

julia> # P'L

Matrix(P'*L)

8×8 Array{Float64,2}:

0.0 0.0 0.0 -2.0 1.41421 -1.41421 1.41421 0.0

0.0 0.0 0.0 0.0 1.41421 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.41421

0.0 0.0 -1.0 -1.0 -1.41421 1.41421 0.0 0.0

-2.82843 -2.82843 2.0 0.0 0.0 0.0 0.0 0.0

-1.41421 -1.41421 8.88178e-16 2.0 0.0 0.0 0.0 0.0

0.0 2.82843 0.0 0.0 0.0 0.0 0.0 0.0

2.82843 0.0 0.0 0.0 0.0 0.0 0.0 0.0

julia> #and we recover our initial A

P'*L*L'*P ≈ Asp

true

The final thing is that, contrary to this, I recover P'L using, not the permutation kept in the component p (which represents P) of the internal structure of C, but its inverted permutation (invperm) which represents P':

julia> sparse(C.L)[invperm(C.p),:]

8×8 SparseMatrixCSC{Float64,Int64} with 19 stored entries:

[5, 1] = -2.82843

[6, 1] = -1.41421

...

julia> Matrix(sparse(C.L)[invperm(C.p),:]) ≈ Matrix(P'*L)

true

Tuesday, September 24, 2019

inferences within Gibbs Sampler in gibbsf90test

Gibbsf90test is the newest Gibbs Sampler of the blupf90 distribution. It is usually used to estimate variance components and sometimes to infer breeding values. However, the MCMC framework and it can be used, for instance, to get estimates of the genetic trend with Bayesian incertitude, or estimates of the contrast between two levels of the same fixed effect. This can be done coupling the inner loops of the Gibbs Sampler, that produce samples of the EBV, variance components, etc, with an external program that will get samples of quantity of interest from them.

We have to use the new option

OPTION external_halfway_script ./pepe 100

this says that every 100 iterations gibbsf90test stops, write to file samples of variances, unknowns and thresholds, all in file last_solutions.txt, that looks like

$ head last_solutions.txt

10000 348226670 76922156

trait effect level solution

1 1 1 191.3774026

1 1 2 176.2276688

1 1 3 147.9162130

1 1 4 113.5460502

1 1 5 144.9539913

1 1 6 148.2475683

1 1 7 125.5390674

1 1 8 172.5408439

$ tail last_solutions.txt

1 6 34584 3.789916994

1 6 34585 59.69961280

1 6 34586 37.09941679

1 6 34587 -34.31674133

1 6 34588 48.62475154

1 6 34589 -20.61125146

1 6 34590 -7.197029014

round variances thresholds

10000 3

530.7338772 591.4818928 1372.611864

then, wait for external process '''name''' (in this case pepe) to run and then restart.

The program pepe is a shell script:

#!/bin/bash

awk '$2==5' last_solutions.txt > ebvs

# select sires and put generation number in front of the ebvs

./put_gen_ebvs.awk males_per_gen ebvs > ebvs_with_gen

./meansby ebvs_with_gen 4 5 >> trend

which in this particular case is condensing average ebv's per generation, like:

-10.1759 -23.0505 -20.9627 -17.4328 -5.93856 -2.26723 19.2128 25.6035 35.5568 38.4155 32.5992

-10.3355 -22.6277 -20.2801 -20.1763 -6.22339 -1.44617 14.995 23.444 33.3165 34.7167 49.2152

-15.2105 -24.0234 -17.4057 -17.352 -5.32282 -0.854277 19.2186 28.2536 36.6748 44.6192 42.8246

-10.912 -21.8507 -19.1843 -18.6449 -4.84831 -1.31681 16.9725 28.5895 37.5614 41.3514 37.848

-13.2666 -24.3615 -22.9572 -18.5257 -6.43787 -3.17327 15.8594 22.785 36.065 34.9368 34.9344

-18.4349 -23.5337 -22.4248 -17.3172 -8.06582 -4.8434 14.9129 22.3128 34.6 34.9195 37.5187

-13.485 -25.4537 -21.1887 -19.5909 -7.28584 -3.8915 17.3923 25.9329 35.2859 38.0994 64.6005

this are samples of the means per generation (each column is one generation, each row is one sample), and thus we can make means +- se per generation. More interestingly, we can get the posterior distribution of the mean of the generation 10 minus the mean of the generation 5:

$ awk '{print $10-$5}' trends | ~/julia ~/save/scripts/myHist.jl

┌ ┐

[32.0, 34.0) ┤▇ 1

[34.0, 36.0) ┤▇▇▇▇ 3

[36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇ 7

[38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇ 7

[40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 18

[42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19

[44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27

[46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12

[48.0, 50.0) ┤▇▇▇▇▇▇▇▇ 6

└ ┘

Thursday, September 19, 2019

plot histogram on command line

I find boring (and slow) to open R to draw an histogram of my data. I have been looking for histograms in command line for some time, and finally I decided to use a small Julia script that uses UnicodePlots.

The script myHist.jl is as simple as this:

@time a=parse.(Float64,chomp.(readlines()))

using UnicodePlots

@time print(histogram(a),"\n")

and it runs beatifully:

awk '{print $1}' ~/save/mtr_2015/renf90.dat | julia myHist.jl

  1.442402 seconds (7.46 M allocations: 402.056 MiB, 19.39% gc time)

                    ┌                                        ┐ 

   [   0.0,  500.0) ┤▇▇ 29256                                  

   [ 500.0, 1000.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 193081                     

   [1000.0, 1500.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 445996    

   [1500.0, 2000.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 454985   

   [2000.0, 2500.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 311294             

   [2500.0, 3000.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇ 166513                       

   [3000.0, 3500.0) ┤▇▇▇▇▇ 69611                               

   [3500.0, 4000.0) ┤▇▇ 23819                                  

   [4000.0, 4500.0) ┤ 6778                                     

   [4500.0, 5000.0) ┤ 1721                                     

   [5000.0, 5500.0) ┤ 367                                      

   [5500.0, 6000.0) ┤ 76                                       

   [6000.0, 6500.0) ┤ 13                                       

   [6500.0, 7000.0) ┤ 4                                        

   [7000.0, 7500.0) ┤ 0                                        

   [7500.0, 8000.0) ┤ 1                                        

└

Wednesday, June 12, 2019

simulation of reduction of variance and Bulmer effect

As it is so difficult to understand reduction of variance due to selection and Bulmer effect I just simulated it in julia (code below). This is a rather crude simulation, but it gives the following plot according to expectattions (the plot using Plotly is actually quite poor).

# simulation of Bulmer effect

using LinearAlgebra
using Distributions
using Random
#using Plotly
using PyPlot

Random.seed!(1234);

nqtl=300
nanim=10000
nselect=Int(nanim/20)
h2=0.8
#additive genetic variance is ~1
varp=1/h2
vare=varp-1
ngen=100

print("Base Population")
# a~ N(0,1)
a=randn(nqtl);

# p~ unif
p=rand(nqtl);
sum2pq=sum(2 .* p .* (1 .-p))
a=a/sqrt(sum2pq);

# storage for QTL genotypes
geno=Array{Int64,3}(undef,nanim,nqtl,2);
Z=Array{Int64,2}(undef,nanim,nqtl);

# simulate base animals
for i in 1:nanim
    for j in 1:nqtl
        for k in 1:2
            geno[i,j,k]=rand(Binomial(1,p[j]),1)[1]
        end
    end
end

vargen=Array{Float64,1}(undef,ngen)

for igen in 1:ngen

    println("-----------------------")
    println("generation number=", igen)
    println("-----------------------")
    for i in 1:nanim
        Z[i,:]=geno[i,:,1]+geno[i,:,2]
    end
    tbv=Z*a;
    println("var tbv= ",var(tbv))
    vargen[igen]=var(tbv)

    # generate phenotype with h2
    y=tbv+randn(nanim)*sqrt(vare);
    println("realized var y= ",var(y))
    println("realized h2= ",var(tbv)/var(y))

    # returns position in ranking based on y
    ranking=sortperm(y,rev=true);
    # pick best animals
    best=ranking[1:nselect];

    tbvselect=tbv[best];
    println("var tbvselect= ",var(tbvselect))

    # now create next generation sampling 2 alleles at random from 
    # selected individuals (possibly with selfing)
    genonew=Array{Int64,3}(undef,nanim,nqtl,2);
    for i in 1:nanim;
        for j in 1:nqtl
            for k in 1:2
                ancestor=rand(1:nselect);
                orig=rand(1:2);
                genonew[i,j,k]=geno[ancestor,j,orig];
            end
        end
    end
    geno[:,:,:]=genonew #not the same as geno=genonew !!!
end
plot(vargen)

Tuesday, April 9, 2019

handling dates in python

I have this data set with elementary test-day records from Manech Tete Noire also known as burubeltza

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;02APR2014

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;05MAY2014

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;05JUN2014

64422506;01;45000130047;052;03;2015;2;07DEC2014;0;30DEC2014

and I need to compute difference between date of lambing and date of milk recording. I thought it would be easy using python(3).

import copy
import datetime

First, read and parse the data

fname=sys.argv[1]
fhand=open(fname)
for i,line in enumerate(fhand):
Dacont = line.split(sep=";")[9]
Datago = line.split(sep=";")[7]

Then, I need to convert these strings 26FEB2014 into something meaningful using module datetime, and compute the difference:

daysfrommisebas= datetime.datetime.strptime(Dacont.lower(),'%d%b%Y') - datetime.datetime.strptime(Datago.lower(), '%d%b%Y')
and finally, print it, not forgetting to paste the strings using "+" and removing the EOL in "line":

print(line.rstrip('\n')+";"+str(daysfrommisebas.days))

renumf90 anf preGSf90 with pedigree and genotypes only

Right, so you have genotypes and pedigree and you would like to use preGSf90 to build G and A22, or for quality control, or for whatever reason. Then you are told to use renumf90 first, but how can you use it with no phenotypes?

Well you can just create your own, fake ones. So, if your genotyped animals are in file marker.geno.clean with this aspect

64000670990546 12012020210211

45214790003 12111120211101

45214790004 22111020110102

you can simply create a random number for each animal, and a file with "animal" and overall mean, using awk:

awk '{print 1,$1,rand()}' marker.geno.clean > phenotypes.txt

and then use the basic renumf90 parameter file in https://artadia.blogspot.com/2019/03/basic-renumf90-parameter-file.html.

After that, you edit the renf90.par file to put the required OPTIONS. For instance, these are the options to get lots of information, check heritability of gene content

OPTION msg 100

OPTION h2_gene_content

Thursday, March 7, 2019

basic renumf90 parameter file

DATAFILE
phenotypes.txt
TRAITS
3
WEIGHT(S)

RESIDUAL_VARIANCE
0.5
EFFECT
1 cross alpha
EFFECT
2 cross alpha #animal
RANDOM
animal
OPTIONAL

FILE
pedigree
FILE_POS
1 2 3 0 0
SNP_FILE
marker.geno.clean
PED_DEPTH
0
(CO)VARIANCES
0.5