artadia

Tuesday, November 27, 2018

How to split a large SNP file into individual chunks

To convert this (one animal per line, UGA format as explained here and here)

1 012202001001000020202000

2 012202001001000021211110

3 012202001001000020202000

4 012211001100010020202000

5 111211101100010110111001

6 002202000000000021211110

7 012211001100010021211110

8 022220002200020020202000

9 012202001001000020202000

10 111211101100010110111001

into as many individual SNP files as loci,e.g.

zcat singlemarker/x000001.gz | head

1 1

2 1

3 1

4 1

5 2

6 1

7 1

8 1

9 1

10 2

zcat singlemarker/x000002.gz | head

1 2

2 2

3 2

4 2

5 2

6 1

7 2

8 3

9 2

10 2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001

1111101211

2222122221

2222222222

0001101201

e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:

split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use GNU fold to insert a newline after each character:

zcat singlemarker/x000001.gz | fold -w 1

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples

for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010

Wednesday, August 8, 2018

submitting parallel jobs in genotoul / genologin with slurm

In another post I found out how to submit parallel jobs in Genotoul. Genotoul has a new cluster (genologin) and a new job scheduler: slurm, instead of "genotoul" with sge.

Here is how to submit a parallel job:

alegarra@genologin2 ~/work/mice_tpb $ cat run.sh

#!/bin/bash

#SBATCH --mem=1G

#SBATCH --cpus-per-task=10 

#SBATCH --time=10:20:00 

module load mpi/openmpi-2.1.2-intel2018.0.128

module load compiler/intel-2018.0.128

export OMP_NUM_THREADS=10

preGSf90 mice_G.par

Note the "module load" that before was not needed. This is because libraries are now mostly dynamic.

Edit look at: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287288/How-to+Submit+an+OpenMP+Job

Saturday, July 7, 2018

Accuracies using blupf90 suite including accf90GS

The blupf90 suite of programs has a few options to compute individual reliabilities.

The first and exact one is using blupf90 itself, by inversion of the Mixed Model Equations, using

OPTION sol se

This is however only doable for small-medium data sets (up to maybe 3 million equations).

The second one, which is more appropriate for threshold traits, is using Gibbs sampler by, for instance, gibbsf90test computing the posterior variance of the EBVs using

OPTION fixed_var all

For an example, see here. Still, this requires large memory and lots of patience.

The third one, which is needed in large data sets, is by approximation. This program, including now genomic information, is accf90GS and is only available under special agreement. (If you can use blupf90 and invert the MME, do it - it is much better than using accf90GS). Several approximations exist, the ones that are used in this program combine essentially three papers:

Misztal, I., & Wiggans, G. R. (1988). Approximation of prediction error variance in large-scale animal models. Journal of Dairy Science, 71, 27-32.
Sanchez, J. P., Misztal, I., & Bertrand, J. K. (2008). Evaluation of methods for computing approximate accuracies of predicted breeding values in maternal random regression models for growth traits in beef cattle. Journal of animal science, 86(5), 1057-1066.
Misztal, I., Tsuruta, S., Aguilar, I., Legarra, A., VanRaden, P. M., & Lawlor, T. J. (2013). Methods to approximate reliabilities in single-step genomic evaluation. Journal of dairy science, 96(1), 647-654.

Running accf90GS is rather tricky. In my case, it gets even worse because I do not use renumf90. So this is how I do:

First, I run preGSf90 to get the diagonal of the matrix $latex G^{-1}-A_{22}^{-1} $latex using, in addition to the usual options

OPTION SNP_file ../typ.blupf90

OPTION CreateGimA22i

the specific option

OPTION saveDiagGimA22iRen

This option saves the elements of the diagonal of this matrix with the same numbers as the first column of the pedigree file, and not in the order 1, 2... as it is ordered internally.

Second, I run my SSGBLUP genetic evaluation as usual.

Third, I run Accf90GS with options

OPTION DiagGinv_file ./DiagGimA22i_ren.txt

OPTION adj_zeta 1.0

OPTION conv_crit 1d-06

where tells the program where to find $diag(G^{-1}-A_{22}^{-1})$ , adj_zeta 1.0 is a measure of the amount of information added by genomics (usually 1) and conv_crit 1d-06 tells the program to iterate but not too much.

Friday, December 22, 2017

Genetic Variances in an observed population

To consider the genetic variance of a population we resort more and more to the genetic variance of an observed population, as opposed to the genetic variance of an ideal population. For instance, the second does not consider neither linkage disequilibrium nor deviations of Hardy-Weinberg equilibrium. Thus the first may be more useful or accurate.

The (empirical) genetic variance of one population composed of $n$ individuals is simply the (empirical) variance of its genetic values:

$latex var(u)=\frac{1}{n}{\sum_i (u_i^2)}-(\frac{1}{n}{\sum_i(u_i)})^2 $latex

This definition does not consider any genetic structure among the observed individuals. However, because genetic values are unknown, we need to estimate them, typically in a Mixed Model context.

To my knowledge, there are two methods to do this. The first is to estimate a variance component (say $latex \sigma^2_u $latex) , that actually refers to an infinite, unrelated population. Then, based on this variance component, we can estimate the expected value of the empirical $var(u)$ over the multivariate distribution $latex Var(\bf u)=\bf K \sigma^2_u$latex where $latex \bf K$latex is a relationship matrix. This results in "my" estimator:

$latex E(var(u))=D_K \sigma^2_u = (mean(diag({\bf K}))-mean({\bf K}))\sigma^2_u $latex

that is becoming increasingly used (and not only by ourselves 😅 ), among other things, because it is extremely simple.

A more refined approach was developed by Sorensen et al. (this is in my opinion Daniel's best paper) and recently revived in the context of genomic prediction by Lehermeier et al. . The estimator is Bayesian by nature, and it constructs the posterior distribution of $latex var(u)$latex from samples of the genetic values $\bf u$, i.e., at an iteration of the Gibbs sampler the sample of $latex var(u)$latex is

$latex \widetilde{var(u)}=\frac{1}{n}{\sum_i (\tilde u_i^2)}-(\frac{1}{n}{\sum_i(\tilde u_i)})^2 $latex

This is well described in BGLR. In fact "my" method is a simplification of Sorensen et al. method.

As a frequent blupf90 suite user, I want to implement this other approach in Gibbs2f90. It is actually quite simple without modifying the code:

write to output samples of genetic values $latex \bf u$latex using OPTION solution all in the parameter file
post-process these samples to obtain samples of $latex \widetilde{var(u)}$latex using $latex \widetilde{var(u)}=\frac{1}{n}{\sum_i (\tilde u_i^2)}-(\frac{1}{n}{\sum_i(\tilde u_i)})^2$latex
get an estimates (i.e. the mean) from the samples of $latex \widetilde{var(u)}$latex

I have re-analysed using Sorensen et al. the mice data set used in "my" estimator to obtain estimates of genetic variance. Using four kind of relationships: Identity by State, VanRaden's G, pedigree relationships and Kernel similarities. The estimates (posterior means) of genetic variance $var(u)$ are :

IBS: 0.037
G: 0.035
Kernel: 0.050
pedigree: 0.035

which compare well with my previous estimates of 0.036, 0.033, 0.047 and 0.038. The larger value of the kernel estimator is because it includes both additive and dominant relationships. When do we expect to have real differences? For small, very informative data sets like Lehermeier et al. data. In which case Sorensen et al. method is better.

Monday, October 9, 2017

Anatomy of a solved bug in blupf90

I had a request from Alban Bouquet to look into a strange Single Step GBLUP problem in multiple-trait pig breeding data. Using freely available blupf90 and the default iterative solver (PCG), a sequence of NaNs (not a number) is produced, with no solution. The intriguing aspect is that deleting a few records or genotypes at random produced a correct sequence of iterations leading to sensible solutions of the SSGBLUP equations. Also, the SSGBLUP run perfectly (and quicker) using blup90iod2.

Let's say that the problem is roughly equivalent in size to 1,000,000 animals in pedigree and 40,000 genotyped individuals.

Using debug options in compilation

optf90 = -g -traceback -check all -fpe0

optf77 = -g -traceback -check all -fpe0

showed that in the pcg.f90 subroutine there was a division by 0 in subroutine

subroutine solve_sparse_hashm_pcg

a subroutine that worked perfectly for many many years. In this line:

alpha=tau/dot_product(p,w)

it turned out that w (a vector) had all values of 0. However, this vector is computed as

w=mult(a,p)

where a is a sparse hash matrix as described in the course notes. This mult is carried out in file fspak90.f90 (which can be downloaded here). If a has elements and p has elements, how come that their product is 0? The reason is that a , which is basically a three-column array, contains lots of lines: actually, 2147483648 elements. It turns out that this is beyond the highest integer in fortran, which is 2147483647. In order to store very large matrices, the hash structure uses "bigger" integers(i8) (8 bytes) up to a MUCH larger value. So, when a hash matrix arrives with its integer(i8) to the naif mult subroutine with its "normal" integers, there is a conversion problem - and the result is 0.

Fixing the problem is as simple as changing

integer::j

into

integer(i8)::j

and the program now works perfectly...

Wednesday, June 28, 2017

The Breeding Value of Harry Potter

Here you have a description of the Wizard trait in Harry Potter world:

So, the trait "wizard" is conferred by a recessive locus with alleles W and M. What is the breeding value of Harry Potter?

Let "wizard" be scored as 1 and "Muggle" as 0.
The different genotypes have a value, in a classical "Falconerian" setting, of

WW WM MM
a d -a
0.5 -0.5 -0.5

so that $a=0.5$, $d=-0.5.$ The substitution effect of W is $\alpha=a+(q-p)d$ , the question is, what is the allelic frequency of W?

If the number of wizards is 1% of the population, then $p^2=0.01$ and $\alpha=0.5+(0.9-0.1)(-0.5)=0.1$.

The breeding value of Harry Potter is $2 \alpha= 0.2$.

The allelic frequency can actually be worked out from here: "According to Rowling, the average Hogwarts annual intake for Muggle-borns is 25%" we can work out the frequency of the allele p.

The bulk of these Muggle-borns come from couples WM x WM, of which there is a frequency $4 p^2 q^2$. These couples produce a wizard WW out of every 4 kids, so the frequency of WW born from Muggle-borns is $p^2 q^2.$ The wizard couples WW x WW are $p^4$ of the population, and every kid is wizard WW too. So they represent p^4 . Then we have that

$$\frac{p^2 q^2}{p^2 q^2 + p^4} =\frac{1}{4}$$

This gives a quadratic equation but it has not solutions in the range $0<p<1$. This is because, if being wizard is rare, by far most wizards come from "Muggle" couples. If the pure wizards are 1% of the population and $p=0.1$, then wizards from Muggle couples WM x WM are 80-fold more than from Wizard WW x WW couples. Should we conclude that most wizards do not go to Rowlands?