Tuesday, October 4, 2022

Colleau's (2002) algorithm for dummies

 Colleau (1992) realized that computations associated with matrix the product Ax or similar products (in different notations) TDT'x , T'x , Tx etc, where T is a triangular matrix that in the i-th row contains the fractions of individual i that come from all preceding individuals, could be easily done by solving instead of computing the whole matrix A. As done for instance in Aguilar et al. 2011 ; this is in fact a key algorithm in implementation of ssGBLUP.

Whereas matrix T is a bit complicated to compute, it was realized (in the 70's ?) that $latex T^{-1} $latex had a very simple structure: $latex T^{-1}  = I -P $latex where P contains 0.5 in the (individual, sire) and (individual, dam) locations and 0 otherwise - see Quaas (1988) for explanations.

A word on notation: Colleau calls T the matrix that links animals to parents; Quaas calls it P. In my opinion Quaas notation is more popular and I stick to it.

The matrix of relationship can be shown to be

$latex A = (I - P)^{- 1}D{(I - P)^{- 1}}^{'} $latex

With inverse

$latex A^{- 1} = (I - P)^{'}D^{- 1}(I - P) $latex

Where $latex P $latex contains, in the i-th row, values of 0.5 for sire(i) and dam(i). This represents the expression $latex u_{i} = u_{s} + u_{d} + \phi $latex. Matrix $latex D $latex is diagonal and contains the variance of the mendelian sampling, i.e. $latex D_{ii} = 0.5 -0.25 \left( F_{s} + F_{d} \right) $latex with Meuwissen and Luo (1992) presenting, in the case of unknown ancestor, $latex s = 0 $latex  (or $latex d = 0 $latex ), their programming set $latex F_{0} = - 1 $latex.

To help ideas, a small pedigree due to Kempthorne is as follows :

A 0 0
B 0 0
D A B
E A D
F B E
Z A B

Renumbered as

1 0 0
2 0 0
3 1 2
4 1 2
5 1 3
6 2 5

The corresponding matrices are

$latex P = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 & 0.5 & 0 \\ \end{pmatrix} $latex

$latex D = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0.5 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.4375 \\ \end{pmatrix} $latex

With $latex F = \begin{pmatrix} 0 & 0 & 0 & 0 & 0.25 & 0.125 \\ \end{pmatrix} $latex

And finally , A=

\[,1\] \[,2\] \[,3\] \[,4\] \[,5\] \[,6\]
\[1,\] 1.000 0.000 0.500 0.5 0.75 0.375
\[2,\] 0.000 1.000 0.500 0.5 0.25 0.625
\[3,\] 0.500 0.500 1.000 0.5 0.75 0.625
\[4,\] 0.500 0.500 0.500 1.0 0.50 0.500
\[5,\] 0.750 0.250 0.750 0.5 1.25 0.750
\[6,\] 0.375 0.625 0.625 0.5 0.75 1.125

The idea of Colleau is to use this decomposition to compute quickly products of the form $latex x = Av $latex solving the system of equations $latex A^{- 1}x = v $latex as $latex (I - P)^{'}D^{- 1}(I - P)x = v $latex from left to right in three steps:

1.  Solve $latex (I - P)^{'}a = v $latex

2.  Solve $latex D^{- 1}b = a $latex

3.  Solve $latex (I - P)x = b $latex

To solve (1) we use the special structure of

$latex (I - P)^{'} = \begin{pmatrix} 1 & 0 & - 0.5 & - 0.5 & - 0.5 & 0 \\ 0 & 1 & - 0.5 & - 0.5 & 0 & - 0.5 \\ 0 & 0 & 1 & 0 & - 0.5 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 0.5 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} $latex

To solve the system

$latex \begin{pmatrix} 1 & 0 & - 0.5 & - 0.5 & - 0.5 & 0 \\ 0 & 1 & - 0.5 & - 0.5 & 0 & - 0.5 \\ 0 & 0 & 1 & 0 & - 0.5 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 0.5 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} \begin{pmatrix} a_{1} \\ a_{2} \\ \ldots \\ \\ \\ a_{6} \\ \end{pmatrix} = \begin{pmatrix} v_{1} \\ v_{2} \\ \ldots \\  \\ \\ v_{6} \\ \end{pmatrix} $latex

Also, for each row, the locations of the $latex - 0.5 $latex  are simply its sire and dam. For instance, the solution of $latex a_{2} $latex is $latex a_{2} = v_{2} + 0.5(a_{3} + a_{4} + a_{6}) $latex, in other words, the value of $latex v_{2} $latex  plus 0.5 times the values of its offspring. Then, we can solve from the bottom to the top and adding "contributions" of +0.5 the value of "a" to the ancestors of animal i , and once we are in row i there is no more modifications to make to animal i **if ancestors are coded before offspring**. As follows:

a=0
do i=nanim,1,-1
    a(i)=a(i)+v(i)
    if (s(i)>0) a(s(i))=a(s(i))+0.5*a(i)
    if (d(i)>0) a(d(i))=a(d(i))+0.5*a(i)
enddo

To solve (2) there are two options, one is to precompute the diagonal of D and then

b=0
b=D*a

Another option is to compute D on the run :

! previously set F(0) to 0
do i=1,nanim
    dii=0.5-0.25*(F(s(i)+F(d(i))
    b(i)=a(i)*dii
enddo

To solve (3) we use the special structure of

$latex I - P = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 1 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 0 & 1 & 0 & 0 \\  - 0.5 & 0 & - 0.5 & 0 & 1 & 0 \\ 0 & - 0.5 & 0 & 0 & - 0.5 & 1 \\ \end{pmatrix} $latex

To solve the system

$latex \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 1 & 0 & 0 & 0 \\  - 0.5 & - 0.5 & 0 & 1 & 0 & 0 \\  - 0.5 & 0 & - 0.5 & 0 & 1 & 0 \\ 0 & - 0.5 & 0 & 0 & - 0.5 & 1 \\ \end{pmatrix}\begin{pmatrix} x_{1} \\ x_{2} \\ \ldots \\  \\  \\ x_{6} \\ \end{pmatrix} = \begin{pmatrix} b_{1} \\ b_{2} \\ \ldots \\  \\  \\ b_{6} \\ \end{pmatrix} $latex

Such that we can solve from the "top" to the "bottom".

x=0
do i=1,nanim
    x(i)=b(i)
    if(s(i)/=0) then
        x(i)=b(i)+0.5*x(s(i))    
    endif
    if(d(i)/=0) then
        x(i)=b(i)+0.5*x(d(i))    
    endif
enddo

with these three steps we are done.


Thursday, August 25, 2022

Compilation options

 Another reminder for myself. Compilation options for Intel Fortran and gfortran:

Intel Fortran:

Optimization-O3 

Debug-g -traceback -check all -fpe0 -check noarg_temp_created

(see here (in English) and here (in French) for meaning)

the option -o tells the name of the produced executable

i.e. the command is something like ifort -O3 myprog.f90 -o myprog

I often use also -heap-arrays  to avoid using unlimit to manipulate the stack size: see  here 

gfortran:

Optimization-O3 

Debug -g -Wall -fbounds-check -fbacktrace -ffpe-trap=invalid,zero,overflow

Detailed explanations here and here (page 24-27)

I often use the option -ffree-line-length-none to allow for lines of any length


Sunday, May 22, 2022

GWAS and QQplots and Genomic Control

 I rarely do GWAS, but from time to time we observe strange things. Some QQplots align nicely:


but others do not (this below is sheep):



We think that in reality, the distribution of the test statistic under the null (SNP effect estimate / s.e. ) may not in fact follow a normal distribution, but something else, due to violations of assumptions. As a aconsequence, p-values do not follow a uniform under H0. For instance, markers are in LD and they capture information from their neighbours, animals are related (even if this is considered), traits are selected, and, essentially, models are not perfect. But we don’t really know - this is just speculation.

A method for correction that is quite popular in Human Genetics, is called Genomic Control (Devlin and Roeder 1999, Devlin et al. 2004). The method was designed to deal with hidden relationships but it is useful for us too. This is a crude method (but very popular) that simply corrects the p-values to make the qqplot fit the expected lines.

So let’s say that x are your -log10(p-values). then you estimate the slope of x, for the points that seem to be under the null (for instance, x[x<=3] in R parlance ). This gives you a slope, say 0.8. Then you multiply all x by the inverse of this slope, e.g. x <- x/0.8  and x now contains “corrected” -log10p-values. By construction, the qqplot will align well.

I think that this is not very scientific but it is current practice.

Below an R program with simulated p-values (some of them frankly extreme) and in one of the qqplot functions I introduced Genomic control (lines “AL”).  At the end you obtain a Figure without:




 or with genomic control:





# example of genomic control
# AL, 17 Feb 2022
set.seed(1234)
myQQplot = function(pvalues){
  # agnostic qq plot using only observed data 
  # https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics
  my.pvalues=pvalues
  exp.pvalues<-(rank(my.pvalues, ties.method="first")+.5)/(length(my.pvalues)+1)
  par(pty="s")
  plot(-log10(exp.pvalues), -log10(my.pvalues), xlim=c(0,10),ylim=c(0,10))
  abline(0,1)
  par(pty="m")
}


# qqplot assuming that p-values follow a uniform distribution (what we want)
# https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics

library(lattice)
qqunif.plot<-function(pvalues,
                      GC=FALSE,
                      should.thin=T, thin.obs.places=2, thin.exp.places=2, 
                      xlab=expression(paste("Expected (",-log[10], " p-value)")),
                      ylab=expression(paste("Observed (",-log[10], " p-value)")), 
                      draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
                      already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif,
                      par.settings=list(superpose.symbol=list(pch=pch)), ...) {
  
  
  #error checking
  if (length(pvalues)==0) stop("pvalue vector is empty, can't draw plot")
  if(!(class(pvalues)=="numeric" || 
       (class(pvalues)=="list" && all(sapply(pvalues, class)=="numeric"))))
    stop("pvalue vector is not numeric, can't draw plot")
  if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
  if (already.transformed==FALSE) {
    if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
  } else {
    if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot")
  }
  
  
  grp<-NULL
  n<-1
  exp.x<-c()
  if(is.list(pvalues)) {
    nn<-sapply(pvalues, length)
    rs<-cumsum(nn)
    re<-rs-nn+1
    n<-min(nn)
    if (!is.null(names(pvalues))) {
      grp=factor(rep(names(pvalues), nn), levels=names(pvalues))
      names(pvalues)<-NULL
    } else {
      grp=factor(rep(1:length(pvalues), nn))
    }
    pvo<-pvalues
    pvalues<-numeric(sum(nn))
    exp.x<-numeric(sum(nn))
    for(i in 1:length(pvo)) {
      if (!already.transformed) {
        pvalues[rs[i]:re[i]] <- -log10(pvo[[i]])
        exp.x[rs[i]:re[i]] <- -log10((rank(pvo[[i]], ties.method="first")-.5)/nn[i])
      } else {
        pvalues[rs[i]:re[i]] <- pvo[[i]]
        exp.x[rs[i]:re[i]] <- -log10((nn[i]+1-rank(pvo[[i]], ties.method="first")-.5)/(nn[i]+1))
      }
    }
  } else {
    n <- length(pvalues)+1
    if (!already.transformed) {
      exp.x <- -log10((rank(pvalues, ties.method="first")-.5)/n)
      pvalues <- -log10(pvalues)
    } else {
      exp.x <- -log10((n-rank(pvalues, ties.method="first")-.5)/n)
    }
  }
  
  # AL GENOMIC CONTROL modification
  # we have pvalues wich are in fact -log10pvalues
  # and exp.x (expected -log10pvalues)
  # we estimate the regression of one on the other
  linReg=lm(pvalues ~ 0 + exp.x)
  print(" linear regression observed -log10pval on expected")
  print(summary(linReg))
  if(GC){
    # estimate coefficient under the null
    linReg=lm(pvalues[pvalues<=3] ~ 0 + exp.x[pvalues<=3])
    print(" linear regression observed -log10pval on expected ONLY for -log10pval <3")
    print(summary(linReg))
    b=linReg$coefficients[1]
    cat("b",b,"\n")
    #correct
    pvalues=pvalues/b
  }
  ## AL end of the modification
  
  #this is a helper function to draw the confidence interval
  panel.qqconf<-function(n, conf.points=1000, conf.col="gray", conf.alpha=.05, ...) {
    require(grid)
    conf.points = min(conf.points, n-1);
    mpts<-matrix(nrow=conf.points*2, ncol=2)
    for(i in seq(from=1, to=conf.points)) {
      mpts[i,1]<- -log10((i-.5)/n)
      mpts[i,2]<- -log10(qbeta(1-conf.alpha/2, i, n-i))
      mpts[conf.points*2+1-i,1]<- -log10((i-.5)/n)
      mpts[conf.points*2+1-i,2]<- -log10(qbeta(conf.alpha/2, i, n-i))
    }
    grid.polygon(x=mpts[,1],y=mpts[,2], gp=gpar(fill=conf.col, lty=0), default.units="native")
  }
  
  #reduce number of points to plot
  if (should.thin==T) {
    if (!is.null(grp)) {
      thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
                                exp.x = round(exp.x, thin.exp.places),
                                grp=grp))
      grp = thin$grp
    } else {
      thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
                                exp.x = round(exp.x, thin.exp.places)))
    }
    pvalues <- thin$pvalues
    exp.x <- thin$exp.x
  }
  gc()
  
  prepanel.qqunif= function(x,y,...) {
    A = list()
    A$xlim = range(x, y)*1.02
    A$xlim[1]=0
    A$ylim = A$xlim
    return(A)
  }
  
  #draw the plot
  xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
         prepanel=prepanel, scales=list(axs="i"), pch=pch,
         panel = function(x, y, ...) {
           if (draw.conf) {
             panel.qqconf(n, conf.points=conf.points, 
                          conf.col=conf.col, conf.alpha=conf.alpha)
           };
           panel.xyplot(x,y, ...);
           panel.abline(0,1);
         }, par.settings=par.settings, ...
  )
}

# simulate p-values under the null
set.seed(1234)
pvalOK=runif(40000)
hist(pvalOK)
myQQplot(pvalOK)
qqunif.plot(pvalOK)

#simulate *wrong* p-values 
pvalNotOK=rbeta(40000,.5,2)
hist(pvalNotOK)
myQQplot(pvalNotOK)
qqunif.plot(pvalNotOK,GC=FALSE)
# genomic control on these p-values looks good !!
qqunif.plot(pvalNotOK,GC=TRUE)

Tuesday, April 12, 2022

Read binary Fortran files in Julia

For instance, the matrices G and A22 from blupf90 . Yes, we can, using FortranFiles:

using FortranFiles
Gfile=FortranFile("G")
# read header 
aa,i,val,j=read(Gfile,FString{10},Int32,Float64,Int32)
G=read(Gfile,(Float64,29138,29138))
mean(G)
close(Gfile)



Friday, February 18, 2022

join with R & tidyverse

 well, this is just a reminder and a place to look for the exact syntax


require(tidyverse)

require(ggplot2)

peds=read_table2("renadd05.ped",      

        col_names=c("nid","ns","nd","varMS","yob",

               "numberKnownParents","numberOfRecords",

                "progenyAsSire","progenyAsDam","origId"),

        col_types="iiidiiiiic"

)

# this file has BLUP_noUPG and BLUP_MF already joined

rels=read_table2("acc_bf90_both",

        col_names=TRUE,

        col_types="iiiddiiidd"

)

rels$nid=rels$levelNoUPG

relpeds= rels %>% inner_join(peds,by="nid")

write_delim(relpeds,"acc_bf90_both_ped")

Monday, January 17, 2022

check format of marker file for blupf90 programs

 The file with genotypes for blupf90 needs to be fixed width separated by spaces. For instance:

    toto 011220001121

  pepepe 012112011012

and for sure, the number of markers need to be the same. This means that all lines need to be the same length and the number of markers too, and the markers need to be right-aligned.

It is not obvious how to check this in a long file. I have found this:

awk 'BEGIN{FS=";"};{l0=length($0);  split($0,a," "); l2=length(a[2]); l1= l0-l2 ; print l2,l0}' markerfile | sort | uniq -c

where markerfile is a file with markers. The script first reads the whole line and computes its length. Then it splits (assuming there are only 2 columns) and checks the length of the second column. It gives something like this:

  29753 38523 38540

where 29753 is the number of lines that show the pattern next to it (38523 markers in a line of 38540 width). This (one format only) is fine.

If things are not properly done we can get this:

  27914 38523 38543

   1839 38523 38546

where there are always 38523 markers, but 27914 lines have a total length of 38543 and 1839 have a longer length of 38546. This (more than one format) is NOT fine and will not be read properly.