Tuesday, November 27, 2018

How to split a large SNP file into individual chunks


To convert this (one animal per line, UGA format as explained here and here)


         1 012202001001000020202000
         2 012202001001000021211110
         3 012202001001000020202000
         4 012211001100010020202000
         5 111211101100010110111001
         6 002202000000000021211110
         7 012211001100010021211110
         8 022220002200020020202000
         9 012202001001000020202000
        10 111211101100010110111001

into as many individual SNP files as loci,e.g.


zcat singlemarker/x000001.gz | head
        1   1
        2   1
        3   1
        4   1
        5   2
        6   1
        7   1
        8   1
        9   1
       10   2

zcat singlemarker/x000002.gz | head
         1   2
         2   2
         3   2
         4   2
         5   2
         6   1
         7   2
         8   3
         9   2
        10   2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001
1111101211
2222122221
2222222222
0001101201


e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:


split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use  GNU fold to insert a newline after each character:

zcat  singlemarker/x000001.gz | fold -w 1 

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples


for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010

Wednesday, August 8, 2018

submitting parallel jobs in genotoul / genologin with slurm

In another post I found out how to submit parallel jobs in Genotoul. Genotoul has a new cluster (genologin) and a new job scheduler: slurm, instead of "genotoul" with sge.

Here is how to submit a parallel job:

alegarra@genologin2 ~/work/mice_tpb $ cat run.sh
#!/bin/bash
#SBATCH --mem=1G
#SBATCH --cpus-per-task=10 
#SBATCH --time=10:20:00 
module load mpi/openmpi-2.1.2-intel2018.0.128
module load compiler/intel-2018.0.128
export OMP_NUM_THREADS=10
preGSf90 mice_G.par

Note the "module load" that before was not needed. This is because libraries are now mostly dynamic.

Edit look at: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287288/How-to+Submit+an+OpenMP+Job

Saturday, July 7, 2018

Accuracies using blupf90 suite including accf90GS


The blupf90 suite of programs has a few options to compute individual reliabilities.

The first and exact one is using blupf90 itself, by inversion of the Mixed Model Equations, using

OPTION sol se

This is however only doable for small-medium data sets (up to maybe 3 million equations).

The second one, which is more appropriate for threshold traits, is using Gibbs sampler by, for instance, gibbsf90test computing the posterior variance of the EBVs using

OPTION fixed_var all

For an example, see here. Still, this requires large memory and lots of patience.

The third one, which is needed in large data sets, is by approximation. This program, including now genomic information, is accf90GS and is only available under special agreement. (If you can use blupf90 and invert the MME, do it - it is much better than using accf90GS).  Several approximations exist, the ones that are used in this program combine essentially three papers:


  • Misztal, I., & Wiggans, G. R. (1988). Approximation of prediction error variance in large-scale animal models. Journal of Dairy Science71, 27-32.
  • Sanchez, J. P., Misztal, I., & Bertrand, J. K. (2008). Evaluation of methods for computing approximate accuracies of predicted breeding values in maternal random regression models for growth traits in beef cattle. Journal of animal science86(5), 1057-1066.
  • Misztal, I., Tsuruta, S., Aguilar, I., Legarra, A., VanRaden, P. M., & Lawlor, T. J. (2013). Methods to approximate reliabilities in single-step genomic evaluation. Journal of dairy science96(1), 647-654.

Running accf90GS is rather tricky. In my case, it gets even worse because I do not use renumf90. So this is how I do:

First, I run preGSf90 to get the diagonal of the matrix $latex G^{-1}-A_{22}^{-1} $latex  using, in addition to the usual options

OPTION SNP_file ../typ.blupf90

OPTION CreateGimA22i

the specific option


OPTION saveDiagGimA22iRen

This option saves the elements of the diagonal of this matrix with the same numbers as the first column of the pedigree file, and not in the order 1, 2... as it is ordered internally.

Second, I run my SSGBLUP genetic evaluation as usual.

Third,  I run Accf90GS with options


OPTION DiagGinv_file ./DiagGimA22i_ren.txt
OPTION adj_zeta 1.0

OPTION conv_crit 1d-06

where  tells the program where to find $diag(G^{-1}-A_{22}^{-1})$ , adj_zeta 1.0 is a measure of the amount of information added by genomics (usually 1) and conv_crit 1d-06 tells the program to iterate but not too much.