artadia: 2018

Tuesday, November 27, 2018

How to split a large SNP file into individual chunks

To convert this (one animal per line, UGA format as explained here and here)

1 012202001001000020202000

2 012202001001000021211110

3 012202001001000020202000

4 012211001100010020202000

5 111211101100010110111001

6 002202000000000021211110

7 012211001100010021211110

8 022220002200020020202000

9 012202001001000020202000

10 111211101100010110111001

into as many individual SNP files as loci,e.g.

zcat singlemarker/x000001.gz | head

1 1

2 1

3 1

4 1

5 2

6 1

7 1

8 1

9 1

10 2

zcat singlemarker/x000002.gz | head

1 2

2 2

3 2

4 2

5 2

6 1

7 2

8 3

9 2

10 2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001

1111101211

2222122221

2222222222

0001101201

e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:

split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use GNU fold to insert a newline after each character:

zcat singlemarker/x000001.gz | fold -w 1

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples

for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010

Wednesday, August 8, 2018

submitting parallel jobs in genotoul / genologin with slurm

In another post I found out how to submit parallel jobs in Genotoul. Genotoul has a new cluster (genologin) and a new job scheduler: slurm, instead of "genotoul" with sge.

Here is how to submit a parallel job:

alegarra@genologin2 ~/work/mice_tpb $ cat run.sh

#!/bin/bash

#SBATCH --mem=1G

#SBATCH --cpus-per-task=10 

#SBATCH --time=10:20:00 

module load mpi/openmpi-2.1.2-intel2018.0.128

module load compiler/intel-2018.0.128

export OMP_NUM_THREADS=10

preGSf90 mice_G.par

Note the "module load" that before was not needed. This is because libraries are now mostly dynamic.

Edit look at: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287288/How-to+Submit+an+OpenMP+Job

Saturday, July 7, 2018

Accuracies using blupf90 suite including accf90GS

The blupf90 suite of programs has a few options to compute individual reliabilities.

The first and exact one is using blupf90 itself, by inversion of the Mixed Model Equations, using

OPTION sol se

This is however only doable for small-medium data sets (up to maybe 3 million equations).

The second one, which is more appropriate for threshold traits, is using Gibbs sampler by, for instance, gibbsf90test computing the posterior variance of the EBVs using

OPTION fixed_var all

For an example, see here. Still, this requires large memory and lots of patience.

The third one, which is needed in large data sets, is by approximation. This program, including now genomic information, is accf90GS and is only available under special agreement. (If you can use blupf90 and invert the MME, do it - it is much better than using accf90GS). Several approximations exist, the ones that are used in this program combine essentially three papers:

Misztal, I., & Wiggans, G. R. (1988). Approximation of prediction error variance in large-scale animal models. Journal of Dairy Science, 71, 27-32.
Sanchez, J. P., Misztal, I., & Bertrand, J. K. (2008). Evaluation of methods for computing approximate accuracies of predicted breeding values in maternal random regression models for growth traits in beef cattle. Journal of animal science, 86(5), 1057-1066.
Misztal, I., Tsuruta, S., Aguilar, I., Legarra, A., VanRaden, P. M., & Lawlor, T. J. (2013). Methods to approximate reliabilities in single-step genomic evaluation. Journal of dairy science, 96(1), 647-654.

Running accf90GS is rather tricky. In my case, it gets even worse because I do not use renumf90. So this is how I do:

First, I run preGSf90 to get the diagonal of the matrix $latex G^{-1}-A_{22}^{-1} $latex using, in addition to the usual options

OPTION SNP_file ../typ.blupf90

OPTION CreateGimA22i

the specific option

OPTION saveDiagGimA22iRen

This option saves the elements of the diagonal of this matrix with the same numbers as the first column of the pedigree file, and not in the order 1, 2... as it is ordered internally.

Second, I run my SSGBLUP genetic evaluation as usual.

Third, I run Accf90GS with options

OPTION DiagGinv_file ./DiagGimA22i_ren.txt

OPTION adj_zeta 1.0

OPTION conv_crit 1d-06

where tells the program where to find $diag(G^{-1}-A_{22}^{-1})$ , adj_zeta 1.0 is a measure of the amount of information added by genomics (usually 1) and conv_crit 1d-06 tells the program to iterate but not too much.