artadia

Tuesday, April 9, 2019

handling dates in python

I have this data set with elementary test-day records from Manech Tete Noire also known as burubeltza

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;02APR2014

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;05MAY2014

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;05JUN2014

64422506;01;45000130047;052;03;2015;2;07DEC2014;0;30DEC2014

and I need to compute difference between date of lambing and date of milk recording. I thought it would be easy using python(3).

import copy
import datetime

First, read and parse the data

fname=sys.argv[1]
fhand=open(fname)
for i,line in enumerate(fhand):
Dacont = line.split(sep=";")[9]
Datago = line.split(sep=";")[7]

Then, I need to convert these strings 26FEB2014 into something meaningful using module datetime, and compute the difference:

daysfrommisebas= datetime.datetime.strptime(Dacont.lower(),'%d%b%Y') - datetime.datetime.strptime(Datago.lower(), '%d%b%Y')
and finally, print it, not forgetting to paste the strings using "+" and removing the EOL in "line":

print(line.rstrip('\n')+";"+str(daysfrommisebas.days))

renumf90 anf preGSf90 with pedigree and genotypes only

Right, so you have genotypes and pedigree and you would like to use preGSf90 to build G and A22, or for quality control, or for whatever reason. Then you are told to use renumf90 first, but how can you use it with no phenotypes?

Well you can just create your own, fake ones. So, if your genotyped animals are in file marker.geno.clean with this aspect

64000670990546 12012020210211

45214790003 12111120211101

45214790004 22111020110102

you can simply create a random number for each animal, and a file with "animal" and overall mean, using awk:

awk '{print 1,$1,rand()}' marker.geno.clean > phenotypes.txt

and then use the basic renumf90 parameter file in https://artadia.blogspot.com/2019/03/basic-renumf90-parameter-file.html.

After that, you edit the renf90.par file to put the required OPTIONS. For instance, these are the options to get lots of information, check heritability of gene content

OPTION msg 100

OPTION h2_gene_content

Thursday, March 7, 2019

basic renumf90 parameter file

DATAFILE
phenotypes.txt
TRAITS
3
WEIGHT(S)

RESIDUAL_VARIANCE
0.5
EFFECT
1 cross alpha
EFFECT
2 cross alpha #animal
RANDOM
animal
OPTIONAL

FILE
pedigree
FILE_POS
1 2 3 0 0
SNP_FILE
marker.geno.clean
PED_DEPTH
0
(CO)VARIANCES
0.5

Tuesday, November 27, 2018

How to split a large SNP file into individual chunks

To convert this (one animal per line, UGA format as explained here and here)

1 012202001001000020202000

2 012202001001000021211110

3 012202001001000020202000

4 012211001100010020202000

5 111211101100010110111001

6 002202000000000021211110

7 012211001100010021211110

8 022220002200020020202000

9 012202001001000020202000

10 111211101100010110111001

into as many individual SNP files as loci,e.g.

zcat singlemarker/x000001.gz | head

1 1

2 1

3 1

4 1

5 2

6 1

7 1

8 1

9 1

10 2

zcat singlemarker/x000002.gz | head

1 2

2 2

3 2

4 2

5 2

6 1

7 2

8 3

9 2

10 2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001

1111101211

2222122221

2222222222

0001101201

e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:

split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use GNU fold to insert a newline after each character:

zcat singlemarker/x000001.gz | fold -w 1

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples

for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010

Wednesday, August 8, 2018

submitting parallel jobs in genotoul / genologin with slurm

In another post I found out how to submit parallel jobs in Genotoul. Genotoul has a new cluster (genologin) and a new job scheduler: slurm, instead of "genotoul" with sge.

Here is how to submit a parallel job:

alegarra@genologin2 ~/work/mice_tpb $ cat run.sh

#!/bin/bash

#SBATCH --mem=1G

#SBATCH --cpus-per-task=10 

#SBATCH --time=10:20:00 

module load mpi/openmpi-2.1.2-intel2018.0.128

module load compiler/intel-2018.0.128

export OMP_NUM_THREADS=10

preGSf90 mice_G.par

Note the "module load" that before was not needed. This is because libraries are now mostly dynamic.

Edit look at: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287288/How-to+Submit+an+OpenMP+Job

Saturday, July 7, 2018

Accuracies using blupf90 suite including accf90GS

The blupf90 suite of programs has a few options to compute individual reliabilities.

The first and exact one is using blupf90 itself, by inversion of the Mixed Model Equations, using

OPTION sol se

This is however only doable for small-medium data sets (up to maybe 3 million equations).

The second one, which is more appropriate for threshold traits, is using Gibbs sampler by, for instance, gibbsf90test computing the posterior variance of the EBVs using

OPTION fixed_var all

For an example, see here. Still, this requires large memory and lots of patience.

The third one, which is needed in large data sets, is by approximation. This program, including now genomic information, is accf90GS and is only available under special agreement. (If you can use blupf90 and invert the MME, do it - it is much better than using accf90GS). Several approximations exist, the ones that are used in this program combine essentially three papers:

Misztal, I., & Wiggans, G. R. (1988). Approximation of prediction error variance in large-scale animal models. Journal of Dairy Science, 71, 27-32.
Sanchez, J. P., Misztal, I., & Bertrand, J. K. (2008). Evaluation of methods for computing approximate accuracies of predicted breeding values in maternal random regression models for growth traits in beef cattle. Journal of animal science, 86(5), 1057-1066.
Misztal, I., Tsuruta, S., Aguilar, I., Legarra, A., VanRaden, P. M., & Lawlor, T. J. (2013). Methods to approximate reliabilities in single-step genomic evaluation. Journal of dairy science, 96(1), 647-654.

Running accf90GS is rather tricky. In my case, it gets even worse because I do not use renumf90. So this is how I do:

First, I run preGSf90 to get the diagonal of the matrix $latex G^{-1}-A_{22}^{-1} $latex using, in addition to the usual options

OPTION SNP_file ../typ.blupf90

OPTION CreateGimA22i

the specific option

OPTION saveDiagGimA22iRen

This option saves the elements of the diagonal of this matrix with the same numbers as the first column of the pedigree file, and not in the order 1, 2... as it is ordered internally.

Second, I run my SSGBLUP genetic evaluation as usual.

Third, I run Accf90GS with options

OPTION DiagGinv_file ./DiagGimA22i_ren.txt

OPTION adj_zeta 1.0

OPTION conv_crit 1d-06

where tells the program where to find $diag(G^{-1}-A_{22}^{-1})$ , adj_zeta 1.0 is a measure of the amount of information added by genomics (usually 1) and conv_crit 1d-06 tells the program to iterate but not too much.