Tuesday, April 9, 2019

handling dates in python

I have this data set with elementary test-day records from Manech Tete Noire also known as burubeltza

64422506;01;45000130047;052;03;2014;1;26FEB2014;0;02APR2014
64422506;01;45000130047;052;03;2014;1;26FEB2014;0;05MAY2014
64422506;01;45000130047;052;03;2014;1;26FEB2014;0;05JUN2014

64422506;01;45000130047;052;03;2015;2;07DEC2014;0;30DEC2014

and I need to compute difference between date of lambing and date of milk recording. I thought it would be easy using python(3).

import copy
import datetime

First, read and parse the data

fname=sys.argv[1]
fhand=open(fname)
for i,line in enumerate(fhand):
    Dacont = line.split(sep=";")[9]
    Datago = line.split(sep=";")[7]

Then, I need to convert these strings 26FEB2014 into something meaningful using module datetime, and compute the difference:

    daysfrommisebas= datetime.datetime.strptime(Dacont.lower(),'%d%b%Y') - datetime.datetime.strptime(Datago.lower(), '%d%b%Y')
and finally, print it, not forgetting to paste the strings using "+" and removing the EOL in "line":


    print(line.rstrip('\n')+";"+str(daysfrommisebas.days))

renumf90 anf preGSf90 with pedigree and genotypes only

Right, so you have genotypes and pedigree and you would like to use preGSf90 to build G and A22, or for quality control, or for whatever reason. Then you are told to use renumf90 first, but how can you use it with no phenotypes?

Well you can just create your own, fake ones. So, if your genotyped animals are in file marker.geno.clean with this aspect

  64000670990546 12012020210211
     45214790003 12111120211101

     45214790004 22111020110102

you can simply create a random number for each animal, and a file with "animal" and overall mean, using awk:


awk '{print 1,$1,rand()}' marker.geno.clean phenotypes.txt

and then use the basic renumf90 parameter file in https://artadia.blogspot.com/2019/03/basic-renumf90-parameter-file.html.

After that, you edit the renf90.par file to put the required OPTIONS. For instance, these are the options to get lots of information, check heritability of gene content 

OPTION msg 100
OPTION h2_gene_content

Thursday, March 7, 2019

basic renumf90 parameter file

DATAFILE
phenotypes.txt
TRAITS
3
WEIGHT(S)

RESIDUAL_VARIANCE  
 0.5 
EFFECT
1 cross alpha
EFFECT
2 cross alpha  #animal
RANDOM
animal
OPTIONAL

FILE
pedigree
FILE_POS
1 2 3 0 0
SNP_FILE
marker.geno.clean
PED_DEPTH
0
(CO)VARIANCES
    0.5 

Tuesday, November 27, 2018

How to split a large SNP file into individual chunks


To convert this (one animal per line, UGA format as explained here and here)


         1 012202001001000020202000
         2 012202001001000021211110
         3 012202001001000020202000
         4 012211001100010020202000
         5 111211101100010110111001
         6 002202000000000021211110
         7 012211001100010021211110
         8 022220002200020020202000
         9 012202001001000020202000
        10 111211101100010110111001

into as many individual SNP files as loci,e.g.


zcat singlemarker/x000001.gz | head
        1   1
        2   1
        3   1
        4   1
        5   2
        6   1
        7   1
        8   1
        9   1
       10   2

zcat singlemarker/x000002.gz | head
         1   2
         2   2
         3   2
         4   2
         5   2
         6   1
         7   2
         8   3
         9   2
        10   2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001
1111101211
2222122221
2222222222
0001101201


e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:


split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use  GNU fold to insert a newline after each character:

zcat  singlemarker/x000001.gz | fold -w 1 

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples


for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010

Wednesday, August 8, 2018

submitting parallel jobs in genotoul / genologin with slurm

In another post I found out how to submit parallel jobs in Genotoul. Genotoul has a new cluster (genologin) and a new job scheduler: slurm, instead of "genotoul" with sge.

Here is how to submit a parallel job:

alegarra@genologin2 ~/work/mice_tpb $ cat run.sh
#!/bin/bash
#SBATCH --mem=1G
#SBATCH --cpus-per-task=10 
#SBATCH --time=10:20:00 
module load mpi/openmpi-2.1.2-intel2018.0.128
module load compiler/intel-2018.0.128
export OMP_NUM_THREADS=10
preGSf90 mice_G.par

Note the "module load" that before was not needed. This is because libraries are now mostly dynamic.

Edit look at: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7287288/How-to+Submit+an+OpenMP+Job

Saturday, July 7, 2018

Accuracies using blupf90 suite including accf90GS


The blupf90 suite of programs has a few options to compute individual reliabilities.

The first and exact one is using blupf90 itself, by inversion of the Mixed Model Equations, using

OPTION sol se

This is however only doable for small-medium data sets (up to maybe 3 million equations).

The second one, which is more appropriate for threshold traits, is using Gibbs sampler by, for instance, gibbsf90test computing the posterior variance of the EBVs using

OPTION fixed_var all

For an example, see here. Still, this requires large memory and lots of patience.

The third one, which is needed in large data sets, is by approximation. This program, including now genomic information, is accf90GS and is only available under special agreement. (If you can use blupf90 and invert the MME, do it - it is much better than using accf90GS).  Several approximations exist, the ones that are used in this program combine essentially three papers:


  • Misztal, I., & Wiggans, G. R. (1988). Approximation of prediction error variance in large-scale animal models. Journal of Dairy Science71, 27-32.
  • Sanchez, J. P., Misztal, I., & Bertrand, J. K. (2008). Evaluation of methods for computing approximate accuracies of predicted breeding values in maternal random regression models for growth traits in beef cattle. Journal of animal science86(5), 1057-1066.
  • Misztal, I., Tsuruta, S., Aguilar, I., Legarra, A., VanRaden, P. M., & Lawlor, T. J. (2013). Methods to approximate reliabilities in single-step genomic evaluation. Journal of dairy science96(1), 647-654.

Running accf90GS is rather tricky. In my case, it gets even worse because I do not use renumf90. So this is how I do:

First, I run preGSf90 to get the diagonal of the matrix $latex G^{-1}-A_{22}^{-1} $latex  using, in addition to the usual options

OPTION SNP_file ../typ.blupf90

OPTION CreateGimA22i

the specific option


OPTION saveDiagGimA22iRen

This option saves the elements of the diagonal of this matrix with the same numbers as the first column of the pedigree file, and not in the order 1, 2... as it is ordered internally.

Second, I run my SSGBLUP genetic evaluation as usual.

Third,  I run Accf90GS with options


OPTION DiagGinv_file ./DiagGimA22i_ren.txt
OPTION adj_zeta 1.0

OPTION conv_crit 1d-06

where  tells the program where to find $diag(G^{-1}-A_{22}^{-1})$ , adj_zeta 1.0 is a measure of the amount of information added by genomics (usually 1) and conv_crit 1d-06 tells the program to iterate but not too much.