Friday, March 1, 2024

extract UPG allocation from renumf90 log

 so you want to know from the output of renumf90 (say ren.log) how many animals were allocated to each UPG?

The output looks like this:

 Unknown parent group allocation

 Equation   Group       #Animals

 59815785       1   21349

...

 Max group = 424; Max UPG ID = 59816208

...

Use this in the command line:

$ sed -n '/allocation/,/Max\ group/p' ren.log | awk '$1 ~ /^[0-9]+$/' 

e.g.

 59815785         21349

 59815786       2    4615

explanations:

find between two patterns:

https://askubuntu.com/a/849016

check if the 1st column is a (positive) integer: 

https://stackoverflow.com/questions/28878995/check-if-a-field-is-an-integer-in-awk



Tuesday, January 23, 2024

Dictionary of arrays in Julia and breed fractions

 So in Julia I want to create a dictionary (hash table or associative array) of arrays to store breed composition in dairy cattle. It took me a while but I found out how to declare it:

julia> a=Dict{String,Array{Float64,1}}()

Imagine the following pedigree

A 0 0 Holstein
B 0 0 Jersey
C A B
D A C

then this Julia script computes breed fractions

#=
A 0 0 Holstein
B 0 0 Jersey
C A B
D A C
=#

breedcomp=Dict{String,Array{Float64,1}}()

#purebred founders
breedcomp["A"]=[1.0,0.0]
breedcomp["B"]=[0.0,1.0]
#rest of pedigree
breedcomp["C"]=0.5*(breedcomp["A"]+breedcomp["B"])
breedcomp["D"]=0.5*(breedcomp["A"]+breedcomp["C"])

display(breedcomp)


Dict{String, Vector{Float64}} with 4 entries:

  "B" => [0.0, 1.0]

  "A" => [1.0, 0.0]

  "C" => [0.5, 0.5]

  "D" => [0.75, 0.25]

Monday, October 23, 2023

SNP effects from Single Step GBLUP with APY

 APY is a technique that allows representing the inverse of a genomic relationship matrix in a sparse format by choosing a "core" of animals (here and here). The authoritative guide to APY is Bermann et al. 2022: this paper

One of the key aspects of genomic models is the ability of estimating SNP effects and then apply them to newly genotyped animals, what is know as Indirect Predictions. Matias and I found out that it's easier than we thought as shown here.

Among many other things Bermann et al. show that one can write indirect predictions of "non-core" animals as (I use the original equation numbering)

eq. (10) : $latex \mathbf{{u}_n}= \mathbf{Z}_n \mathbf{Z}'_c (\mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c  \mathbf{{a}} + \boldsymbol{\xi} $latex

where $latex \mathbf{{a}} $latex are SNP effects and $latex \boldsymbol{\xi} $latex is an error term that does not depend on $latex \mathbf{{u}_c} $latex. 

we obtain SNP effects from eq. 21 in Bermann et al:

$latex \mathbf{\hat{a}}=k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}  $latex

we plug that into (10) and expand:

$latex \mathbf{\hat{u}_n}= \mathbf{Z}_n \mathbf{Z}'_c (\mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{\hat{a}} =k \mathbf{Z}_n \mathbf{Z}'_c (k \mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{\hat{a}} $latex

we substitute for $latex \mathbf{\hat{a}} $latex:

$latex \mathbf{\hat{u}_n} =k \mathbf{Z}_n \mathbf{Z}'_c (k \mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}=k \mathbf{Z}_n \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c} =  \mathbf{Z}_n \mathbf{\hat{a}} $latex

which is the original eq. 21. This is because $latex \hat{a} $latex  is part of the  column space of  $latex \mathbf{Z}'_c $latex, then $latex \mathcal{P}\mathbf{\hat{a}} = \mathbf{\hat{a}} $latex.

Finally, obtaining Indirect Predictions from APY is deadly simple and intuitive:

  • $latex \mathbf{\hat{a}}=k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}  $latex
  • $latex \mathbf{\hat{u}_n} =  \mathbf{Z}_n \mathbf{\hat{a}} $latex



Thursday, May 25, 2023

compiling macs in MacBook with Ventura

 I wanted to compile the coalescent simulator macs in the Mac running Ventura with M2 chips. This was of course the beginning of a great adventure. I end up doing the following:


- install library boost using homebrew

- dig out the path for the boost library e.g. as here, which turned out to be /opt/homebrew/Cellar/boost/1.81.0_1/include

- finally, modify the makefile as follows

# compile options
CFLAGS = -Wall -g
#CFLAGS = -Wall -O3
# Add location of any library locations below with -L
LINKFLAGS =
#LINKFLAGS = -static

# compiler
CC = g++

# libraries. For a local Boost installation
# Example:
#LIB = -I /Users/garychen/software/boost_1_36_0
# Default:
#LIB = -I .
LIB = -I /opt/homebrew/Cellar/boost/1.81.0_1/include

# simulator name
SIM=macs

OBJS = simulator.o algorithm.o datastructures.o

$(SIM) : $(OBJS)
$(CC) -o $(SIM) $(OBJS) $(LINKFLAGS)

simulator.o: simulator.cpp simulator.h
$(CC) $(CFLAGS) $(LIB) -c $<

algorithm.o: algorithm.cpp simulator.h
$(CC) $(CFLAGS) $(LIB) -c $<

datastructures.o: datastructures.cpp simulator.h
$(CC) $(CFLAGS) $(LIB) -c $<

Thursday, February 2, 2023

draw many elements julia

 # a vector of frequencies

p=rand(Beta(2,2),10)

# a vector of "distributions"

a=Binomial.(2,p)

# a vector of vectors

 b=rand.(a,1)

# collapsed (row vector)

reduce(hcat,b)


# draw 5 animals at once


b=rand.(a,5)

# collapse

bb=reduce(hcat,b)


#compute frequencies

pest = mean(bb,dims=1)/2

Tuesday, October 4, 2022

Colleau's (2002) algorithm for dummies

 Colleau (1992) realized that computations associated with matrix the product Ax or similar products (in different notations) TDT'x , T'x , Tx etc, where T is a triangular matrix that in the i-th row contains the fractions of individual i that come from all preceding individuals, could be easily done by solving instead of computing the whole matrix A. As done for instance in Aguilar et al. 2011 ; this is in fact a key algorithm in implementation of ssGBLUP.

Whereas matrix T is a bit complicated to compute, it was realized (in the 70's ?) that $latex T^{-1} $latex had a very simple structure: $latex T^{-1}  = I -P $latex where P contains 0.5 in the (individual, sire) and (individual, dam) locations and 0 otherwise - see Quaas (1988) for explanations.

A word on notation: Colleau calls T the matrix that links animals to parents; Quaas calls it P. In my opinion Quaas notation is more popular and I stick to it.

The matrix of relationship can be shown to be

$latex A = (I - P)^{- 1}D{(I - P)^{- 1}}^{'} $latex

With inverse

$latex A^{- 1} = (I - P)^{'}D^{- 1}(I - P) $latex

Where $latex P $latex contains, in the i-th row, values of 0.5 for sire(i) and dam(i). This represents the expression $latex u_{i} = u_{s} + u_{d} + \phi $latex. Matrix $latex D $latex is diagonal and contains the variance of the mendelian sampling, i.e. $latex D_{ii} = 0.5 -0.25 \left( F_{s} + F_{d} \right) $latex with Meuwissen and Luo (1992) presenting, in the case of unknown ancestor, $latex s = 0 $latex  (or $latex d = 0 $latex ), their programming set $latex F_{0} = - 1 $latex.

To help ideas, a small pedigree due to Kempthorne is as follows :

A 0 0
B 0 0
D A B
E A D
F B E
Z A B

Renumbered as

1 0 0
2 0 0
3 1 2
4 1 2
5 1 3
6 2 5

The corresponding matrices are

$latex P = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 & 0.5 & 0 \\ \end{pmatrix} $latex

$latex D = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0.5 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.4375 \\ \end{pmatrix} $latex

With $latex F = \begin{pmatrix} 0 & 0 & 0 & 0 & 0.25 & 0.125 \\ \end{pmatrix} $latex

And finally , A=

\[,1\] \[,2\] \[,3\] \[,4\] \[,5\] \[,6\]
\[1,\] 1.000 0.000 0.500 0.5 0.75 0.375
\[2,\] 0.000 1.000 0.500 0.5 0.25 0.625
\[3,\] 0.500 0.500 1.000 0.5 0.75 0.625
\[4,\] 0.500 0.500 0.500 1.0 0.50 0.500
\[5,\] 0.750 0.250 0.750 0.5 1.25 0.750
\[6,\] 0.375 0.625 0.625 0.5 0.75 1.125

The idea of Colleau is to use this decomposition to compute quickly products of the form $latex x = Av $latex solving the system of equations $latex A^{- 1}x = v $latex as $latex (I - P)^{'}D^{- 1}(I - P)x = v $latex from left to right in three steps:

1.  Solve $latex (I - P)^{'}a = v $latex

2.  Solve $latex D^{- 1}b = a $latex

3.  Solve $latex (I - P)x = b $latex

To solve (1) we use the special structure of

$latex (I - P)^{'} = \begin{pmatrix} 1 & 0 & - 0.5 & - 0.5 & - 0.5 & 0 \\ 0 & 1 & - 0.5 & - 0.5 & 0 & - 0.5 \\ 0 & 0 & 1 & 0 & - 0.5 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 0.5 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} $latex

To solve the system

$latex \begin{pmatrix} 1 & 0 & - 0.5 & - 0.5 & - 0.5 & 0 \\ 0 & 1 & - 0.5 & - 0.5 & 0 & - 0.5 \\ 0 & 0 & 1 & 0 & - 0.5 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 0.5 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} \begin{pmatrix} a_{1} \\ a_{2} \\ \ldots \\ \\ \\ a_{6} \\ \end{pmatrix} = \begin{pmatrix} v_{1} \\ v_{2} \\ \ldots \\  \\ \\ v_{6} \\ \end{pmatrix} $latex

Also, for each row, the locations of the $latex - 0.5 $latex  are simply its sire and dam. For instance, the solution of $latex a_{2} $latex is $latex a_{2} = v_{2} + 0.5(a_{3} + a_{4} + a_{6}) $latex, in other words, the value of $latex v_{2} $latex  plus 0.5 times the values of its offspring. Then, we can solve from the bottom to the top and adding "contributions" of +0.5 the value of "a" to the ancestors of animal i , and once we are in row i there is no more modifications to make to animal i **if ancestors are coded before offspring**. As follows:

a=0
do i=nanim,1,-1
    a(i)=a(i)+v(i)
    if (s(i)>0) a(s(i))=a(s(i))+0.5*a(i)
    if (d(i)>0) a(d(i))=a(d(i))+0.5*a(i)
enddo

To solve (2) there are two options, one is to precompute the diagonal of D and then

b=0
b=D*a

Another option is to compute D on the run :

! previously set F(0) to 0
do i=1,nanim
    dii=0.5-0.25*(F(s(i)+F(d(i))
    b(i)=a(i)*dii
enddo

To solve (3) we use the special structure of

$latex I - P = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 1 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 0 & 1 & 0 & 0 \\  - 0.5 & 0 & - 0.5 & 0 & 1 & 0 \\ 0 & - 0.5 & 0 & 0 & - 0.5 & 1 \\ \end{pmatrix} $latex

To solve the system

$latex \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 1 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 0 & 1 & 0 & 0 \\  - 0.5 & 0 & - 0.5 & 0 & 1 & 0 \\ 0 & - 0.5 & 0 & 0 & - 0.5 & 1 \\ \end{pmatrix}\begin{pmatrix} x_{1} \\ x_{2} \\ \ldots \\  \\  \\ x_{6} \\ \end{pmatrix} = \begin{pmatrix} b_{1} \\ b_{2} \\ \ldots \\  \\  \\ b_{6} \\ \end{pmatrix} $latex

Such that we can solve from the "top" to the "bottom".

x=0
do i=1,nanim
    x(i)=b(i)
    if(s(i)/=0) then
        x(i)=b(i)+0.5*x(s(i))    
    endif
    if(d(i)/=0) then
        x(i)=b(i)+0.5*x(d(i))    
    endif
enddo

with these three steps we are done.


Thursday, August 25, 2022

Compilation options

 Another reminder for myself. Compilation options for Intel Fortran and gfortran:

Intel Fortran:

Optimization-O3 

Debug-g -traceback -check all -fpe0 -check noarg_temp_created

(see here (in English) and here (in French) for meaning)

the option -o tells the name of the produced executable

i.e. the command is something like ifort -O3 myprog.f90 -o myprog

I often use also -heap-arrays  to avoid using unlimit to manipulate the stack size: see  here 

gfortran:

Optimization-O3 

Debug -g -Wall -fbounds-check -fbacktrace -ffpe-trap=invalid,zero,overflow

Detailed explanations here and here (page 24-27)

I often use the option -ffree-line-length-none to allow for lines of any length