Friday, December 9, 2016

Sires in validation and Sires in training

I want to test metafounders for genomic selection. So I have these MTR sires that have daughters in the validation , in the training, or in both of them, and I want a file with these numbers. I have two files:

T is:
juan 10
pepe 5

V is:

luis 6
pepe 4

so luis has no daughters on Training, and I want this to appear. Then I use this pipeline from Toni Reverter from CSIRO:

awk '{print $1}' $1 $2 | sort -u | join -a1 - $1 | awk '{print $1, (NF==2?$2:0)}' | join -a1 - $2 | awk '{print $1, $2, (NF==3?$3:0)}'

which gives

putterri:yarp andres$ ./joinToni.sh V T

juan 0 10

luis 6 0

pepe 4 5

Install GNU coreutils in the Mac

The MacOSX join is poorer than GNU one. So I try toinstall it using macports. First I have to update the data base of port:

putterri:mf andres$ port -v selfupdate

But this does not work. Then I try

putterri:mf andres$ sudo port -d selfupdate

This does work. Join is actually part of coreutils:

putterri:mf andres$ port search --name coreutils

coreutils @8.25 (sysutils)

    GNU File, Shell, and Text utilities

xml-coreutils @0.8.1_1 (textproc, xml)

    Command line tools for XML processing

Found 2 ports.

So I do install coreutils:

putterri:mf andres$ sudo port install coreutils

The most important information is at the end:

The tools provided by GNU coreutils are prefixed with the character 'g' by default to distinguish them from the BSD commands.

For example, cp becomes gcp and ls becomes gls.

If you want to use the GNU tools by default, add this directory to the front of your PATH environment variable:

    /opt/local/libexec/gnubin/

Saturday, November 19, 2016

Plotting correlations

I found this nice R package to plot correlations : corrplot . Imagine that you want to present five genetic correlations to your buddies. Here's how to visualize them nicely.

#create a covariance matrix
set.seed(1234)
V=rWishart(1,10,diag(5))[,,1]
cov2cor(V)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 0.1169636 0.2236465 -0.2667448 0.04351447
[2,] 0.11696358 1.0000000 -0.2260988 -0.3425368 0.64837326
[3,] 0.22364648 -0.2260988 1.0000000 -0.1883763 -0.22484960
[4,] -0.26674477 -0.3425368 -0.1883763 1.0000000 -0.50315007
[5,] 0.04351447 0.6483733 -0.2248496 -0.5031501 1.00000000
#give it names
colnames(V)=c("MY","gain","longevity","SCS","pietin")
rownames(V)=c("MY","gain","longevity","SCS","pietin")

#plot
require(corrplot)
corrplot.mixed(cor(V),col=gray.colors(10))

Thursday, October 27, 2016

French Keyboard on US Macbook Keyboard

I have a US-like keyboard with this key disposition

However, it happens to me to write in French changing keyboard preferences to French. The keys are mapped to a French keyboard. I use a rubber cover from kbcovers.com, which is excellent and looks like this:

However, the physical layout of a true French macbok keyboard is slightly different. Also, and more impostant, some important characters such as | ~ # [ ] { } are not present, or difficult to see, in the cover.

So, I prepared my own cheatsheet:

Yes, it does look horrible and yes, it does work :-)

Tuesday, September 20, 2016

formatting SNPs using R or awk

In some software for genomic prediction (blupf90, GS3 and may be other) the genotypes shoukd be given in a plain text file as follows:

snp_file.txt

          25 1121022100
         600 0111220012
        1333 0110111111
           5 1120112102
          89 0111220001

with no spaces between genotypes, id and genotypes separated by spaces - no tabs - and all genotypes starting at the same column. Sometimes it is not obvious how to get this format. Llibertat Tusell got a solution in R:

snps=sample(c(0,1,2),prob=c(.25,0.5,.25),size=50,replace=T)
X=matrix(ncol=10,nrow=5,snps)
animal=c(25,600,1333,5,89)

con <- file("snp_file.txt", "w")

for (i in 1:5){
tmp=paste( X[i,] , collapse = "" )
cat(format(animal[i],width=12),tmp,'\n',file=con)
}
close(con)

The R program collapses markers into a single string, then puts format to the animal "word" so that it has constant width, then it writes it to a file

Another solution is to use awk and start from a file, e.g.

$ cat exo_geno_spaces

1101 1 0 2 2 1 1 2

101 2 2 1 1 2 1 1

254 1 1 2 0 1 1 0

255 2 1 2 2 0 1 2

Then you can use an awk program ./remove_spaces_snps.awk:

#! /opt/local/bin/gawk -f

# this script removes space between SNP genotypes

# and formatting as UGA

BEGIN{}

{

# print animal

printf( "%20s",$1)

printf( "%1s"," ")

for (i =2; i<=NF; i++){

printf( "%1s",$i)

}

printf("\n")

}

END{}

This awk program prints on stdout the animal (with constant width) then markers without sopace separation, then a newline.

$ ./remove_spaces_snps.awk exo_geno_spaces > out

$ cat out

1101 1022112

101 2211211

254 1120110

255 2122012

Monday, September 19, 2016

Livestock Fair: Pirenaica cattle

I was in a livestock fair in Irurtzun and I could take good pictures of local breeds. These are Pirenaica cows and calves. The Pirenaica breeding association is Conaspi and is has been the object of many scientific publications.

In the same village, there used to be a weekly livestock fair every tuesday until the 70's. In this old picture circa 1950 you can see some local animals (some of them Pirenaica) that were used, among other things, for working the fields.

Thursday, September 15, 2016

Reordering matrix in R

I have this matrix with relationships across metafounders and years. However the matrix is disordered, let's say:

> g
1998 2010 1970
1998 0.90 0.67 0.83
2010 0.67 0.92 0.52
1970 0.83 0.52 0.95

I would like to sort this matrix in ascending order, i.e. the cell corrresponding to [1970,1970] should go on top left.

This can be done using a loop but it is tricky. In R this is easy using sorted indices:

> g1=g[order(rownames(g)),order(colnames(g))]
> g1
1970 1998 2010
1970 0.95 0.83 0.52
1998 0.83 0.90 0.67
2010 0.52 0.67 0.92

Tuesday, September 13, 2016

ifort options -openmp and -qopenmp

I work in three computers, two servers (let's call them grits and cassoulet) and my own Mac machine. Well, I found out that the options to use OpenMP in the Intel Fortran compiler (ifort) are different depending on the compiler version:

grits: ifort 15.0.3: -qopenmp
Mac: ifort 15.0.2: -qopenmp

but...

cassoulet: ifort 14.0.3: -openmp

According to the documentation for for ifort, the option -openmp is deprecated .

Friday, July 29, 2016

gawk in MobaXterm

MobaXterm is a nice unix/linux console emulator. We were trying to use it to run a script for genetic evaluation, yet in one of the awk scripts we used the sentence
BEGIN {FIELDWIDTHS = "12 14 8 2 3 3 8 1 3 14 14 3 8 14" }

it seems that this FIELDWIDTHS is restricted to gawk, but not to all implementations of awk, in particular not to the one in MobaXterm. However MobaXterm has the possibility of installing programs and plugins. Thus I try

[andres.ANDRESLEGAR2422] → apt-get

apt-cyg: Installs and removes Cygwin packages.

  "apt-cyg install <package names>" to install packages

So, I install it

[andres.ANDRESLEGAR2422] → apt-get install gawk

Trying to download file setup.bz2

Updated setup.ini

Found package gawk

Downloading gawk-4.1.3-1.tar.xz...

Unpacking gawk-4.1.3-1.tar.xz...

Extracting dependencies for usr/bin/gawk.exe...

Extracting dependencies for usr/bin/gawk-4.1.3.exe...

Extracting dependencies for usr/libexec/awk/pwcat.exe...

Extracting dependencies for usr/libexec/awk/grcat.exe...

Extracting dependencies for usr/lib/gawk/filefuncs.dll...

Extracting dependencies for usr/lib/gawk/fnmatch.dll...

Extracting dependencies for usr/lib/gawk/fork.dll...

Extracting dependencies for usr/lib/gawk/inplace.dll...

Extracting dependencies for usr/lib/gawk/ordchr.dll...

Extracting dependencies for usr/lib/gawk/readdir.dll...

Extracting dependencies for usr/lib/gawk/readfile.dll...

Extracting dependencies for usr/lib/gawk/revoutput.dll...

Extracting dependencies for usr/lib/gawk/revtwoway.dll...

Extracting dependencies for usr/lib/gawk/rwarray.dll...

Extracting dependencies for usr/lib/gawk/testext.dll...

Extracting dependencies for usr/lib/gawk/time.dll...

Package gawk requires the following packages, installing bash cygwin libgcc1 libgmp10 libintl8 libmpfr4 libreadline7

Package bash is already installed, skipping

Package cygwin is already installed, skipping

Found package libgcc1

Package libgcc1 is already included, skipping

Found package libgmp10

Installing libgmp10

Downloading libgmp10-6.1.0-3p1.tar.xz...

Unpacking libgmp10-6.1.0-3p1.tar.xz...

Extracting dependencies for usr/bin/cyggmp-10.dll...

Package libgmp10 requires the following packages, installing cygwin

Package cygwin is already installed, skipping

Package libgmp10 installed.

Found package libintl8

Package libintl8 is already included, skipping

Found package libmpfr4

Installing libmpfr4

Downloading libmpfr4-3.1.4-1.tar.xz...

Unpacking libmpfr4-3.1.4-1.tar.xz...

Extracting dependencies for usr/bin/cygmpfr-4.dll...

Package libmpfr4 requires the following packages, installing cygwin libgcc1 libgmp10

Package cygwin is already installed, skipping

Found package libgcc1

Package libgcc1 is already included, skipping

Package libgmp10 is already installed, skipping

Package libmpfr4 installed.

Found package libreadline7

Package libreadline7 is already included, skipping

Package gawk installed.

Rebasing new libraries

Found package rebase

Installing rebase

Downloading rebase-4.4.2-1.tar.xz...

Unpacking rebase-4.4.2-1.tar.xz...

Extracting dependencies for usr/bin/rebase.exe...

Extracting dependencies for usr/bin/peflags.exe...

Package rebase requires the following packages, installing coreutils cygwin grep gzip sed

Found package coreutils

Package coreutils is already included, skipping

Package cygwin is already installed, skipping

Package grep is already installed, skipping

Found package gzip

Package gzip is already included, skipping

Found package sed

Package sed is already included, skipping

Package rebase installed.

Then I change in the scripts to use /bin/gawk.exe

Now everything seems to work fine.

Wednesday, July 13, 2016

"new" equation editor in word

Since circa 2010 you can do nice things with the Equation Editor (even for Mac) in Word and Power Point. For instance my notes on genomic prediction. Note that this is NOT the horrible Equation Editor that you think:

But a new one which has LaTex-y kind of input, using Math Autocorrect:

which is based on this. It works nicely in PowerPoint too:

So, I completely gave up using MathType.

However, there are no good resources for this nrew Equation Editor, and the web of Math Autocorrect is not very good.

I used this documentati to learn (although it is a bit outdated):
http://school-maths.com/documents/Word_2010_stols.pdf .

You may also like these two titles:

The Word 2007/2010 Equation Editor
Using Keystrokes to Write EquationsIn Microsoft Office 2007 Equation Editor

A cheatsheet with shortcuts.

There is also MathInOffice is a very nice resource but not exactly a tutorial.

Thursday, July 7, 2016

Working in Windows machines

To my taste, Windows is inefficient for serious number crunching. But if you need to do so, and you feel like doing things the Unix way, you may:

install MobaXterm, a nice unix terminal emulator
install notepad++, en excellent plain text editor
get a linux/unix command cheatsheet (most commands are in MobaXterm):

In Spanish: Comandos Unix/Linux – Guía de Referencia
In French
or in English

I also recommend to use awk (although this is by no means absolute need). For an awk manual, the best one that I know is in Spanish, but there are many out there.
And finally, R and gfortran .

There is always the question on how to open a terminal. Here it goes for Windows 10 and for Windows 7.

Saturday, May 28, 2016

R in Jupyter

First I had to install Macports for El Capitan:

https://www.macports.org/install.php

Then I followed the instrucions in: http://irkernel.github.io/requirements/

Using the "source" method.

New web page using Markdown

I have a web page hosted by Genotoul. I changed yesterday the ugly old one to a new one, much lighter. The main reason I changed is that I discovered how easy is to create html files using MarkDown with an appropriate text editor:

# This is a title

This is a text

* item1
* item2

Translates into

This is a title

This is a text

item1
item2

Sunday, May 8, 2016

Burguete horses and Pirenaica cattle

In Navarre: Burguete mare and foal, and Pirenaica cows.

Chisqueta sheep in Sobrarbe

Still a sheep breed that I heard about but never saw, the Chisqueta or Xisqueta sheep. Here are some references in aragonese and catalan, and here is one in Spanish.

Wednesday, May 4, 2016

Variance components with high numbers

I have this large data set in which I want to estimate heritability. These are the features for the trait:

n   nmiss    avg          min        max         var       

------------------------------------------------------------------------------

1703515 0  1772.570604     100.470000  7783.789000 524507.442442

So, the variance for the trait is very high. Blupf90 runs nicely. However, both airemlf90 and gibbs2f90 variance component programs in the suite gave me some trouble if I used. The reason is that the programs contain some "tolerance" parameter in the inverses of the updated variance components, and this seems to set some of the numbers to 0.

For instance, if divide the value of the performance by 10, I get this:

 * Start Gibbs iteration04-29-2016  16h 26m 59s 482

 round            1

G

   768.75    

G

   157.18    

R

   0.0000    

 round            2

The reason for this is the scale and the large number of data. The sum of squares of the residual is

 ee=  0.16456E+10

when this is sent to the subroutine, its inverse is lower than 10^-9 and set to 0.

I finally standardized MY to a variance of 1. I put initial variances to $latex \sigma^2_g=\sigma^2_p=\sigma^2_e=1 $latex. After 1000 iterations of Gibbs, I get:

* End of iteration04-29-2016  12h 09m 52s 514

 ave G

  0.11236    

 SD G

  0.31784E-01

 ave G

  0.14050    

 SD G

  0.13210E-01

 ave R

  0.25338    

 SD R

  0.58904E-02

In this case I obtain correct estimates ($latex h^2 =0.22 $latex, repeatability of 0.5).

Standard errors of complex functions of variance components from REML Information matrix

In airemlf90 and other programs, a byproduct of the AIREML is the inverse of the Information Matrix that, asymptotically, describes the sampling covariance of the variance components. It can also be considered as an approximation to the posterior distribution of the parameters.

Getting standard errors of functions of the variance components is rather painful algebraically as one requires the delta method and Taylor series expansions. A much simpler method , implemented in as OPTION se_covar_function in airemlf90, comes through MonteCarlo sampling and was described by Meyer and Houle. Here I will show how this work.
This a model with three variance components, genetic, permanent and residual variance. Estimates from airemlf90 are in vector ahat:

#genetic, permanent, residual

ahat=c(

  0.11478,    

  0.13552,    

  0.25290,

)

with AI matrix:

# inverse of AI matrix (Sampling Variance)

AI=matrix(c(

  0.16799E-05, -0.96486E-06, -0.82566E-08,

 -0.96486E-06,  0.96167E-06, -0.37113E-07,

 -0.82566E-08, -0.37113E-07,  0.10864E-06)

 ,ncol=3)

asymptotically, the "true" value has a sampling (or posterior) distribution a~N(ahat,AI), and getting samples from this distribution is easy:

require(MASS)

b=mvrnorm(10000,ahat,AI)

> head(b)

[,1] [,2] [,3]

[1,] 0.1146738 0.1357640 0.2529399

[2,] 0.1163889 0.1342926 0.2528479

[3,] 0.1166155 0.1344342 0.2525161

[4,] 0.1142085 0.1358928 0.2534974

[5,] 0.1136835 0.1361108 0.2530133

[6,] 0.1140485 0.1365707 0.2530573

and from here, we can obtain samples of the heritability and its standard deviation:

h2=b[,1]/(b[,1]+b[,2]+b[,3])

sd(h2)

> 0.002318198

This is very useful, i.e. for complex model such as consideration of genomic and dominance variation.

Friday, April 29, 2016

Equations in blogger

http://irrep.blogspot.fr/2011/07/mathjax-in-blogger-ii.html

Seems to work: $h^2=1$

UPDATE: it does work but then it messes up the $'s corresponding to awk expressions so I desactivate it.

Monday, April 25, 2016

Estimation of allelic base frequencies

Tao Xiang is estimating base frequencies in a pig population. A simple but efficient method is Gengler et al. 2007 . This is actually very similar to GCMTBLUP, nut the latter considers one locus and an associated trait. Thus, Tao implement the model g=1m+Za+e using blupf90 software, where m=2p. The data file, with genotype coded as 0/1/2 looks like

6051 1 2

6930 1 2

5205 1 2

7478 1 0

3208 1 1

Using direct inversion by fspak, the solution is p=0.88 . However, using the default iterative solver of blupf90 we get 0.44 !! Actually, using a more strict convergence criteria we get the exact solution:

  convcrit iter      20*m         p

1       12    5  9.016301 0.4508151

2       13    5  9.016301 0.4508151

3       14    9  9.016345 0.4508172

4       15   11  9.016358 0.4508179

5       16  103 16.779241 0.8389621

6       17  110 17.503890 0.8751945

7       18  117 17.570641 0.8785320

8       19  124 17.574420 0.8787210

9       20  126 17.574496 0.8787248

where the convergence criteria is actually 10^(-convcrit). So, the iterator is stuck at 0.45. The reason for this is that heritability is 1 so that for a given mu, EBVs for genotyped animals almost do not change and viceversa.

This is how the correct estimate of EBV looks like:

There is no pathology in this graph. The first 7000 individuals have genotype, so the breeding value is simply the deviation of the genotype from the populational mean. The three bars are individuals in genotypes AA, Aa and aa. The rest of individuals are ancestors, vaguely recoded in ascending order, so that their genotype is not neatly estimated.

This is how the wrong estimate of EBV looks like:

With biased EBVs for genotyped animals and, for non genotyped animals, EBVs quite different from the previous graph.

Take home message: if you estimate base allele frequencies using Gengler method by BLUP, use either solutions by inversion or set the convergence criteria to strict values.

Monday, April 4, 2016

Installing Julia and Jupyter in Windows 7

Rohan Fernando came to INRA Toulose last two weeks and he showed us Julia and Jupyter. They look very nice. The instructions to install both are here and here. I installed them in my Mac without much trouble but in the Windows 7 machine of Jean Michel Elsen was more difficult. There were issues related to WinRPM that I did not understand.

To install in JM computer I did

Clean up everything, desinstall Julia, Anaconda, etc if already installed; delete folders .Julia, .Jupyter, etc.
Install Julia
Install Anaconda (Python >3)
Start Julia. In its command line, Pkg.add("IJUlia") [ this is not clearly documented]. This seems to install Jupyter automatically. Let Julia open.
Open a command line in Windows (in "Accesories"), then write jupyter notebook. The default navigator (Firefox) should open.

It works correctly in the Windows 7 Virtual Machine in my Mac

Tuesday, March 22, 2016

substitute number with sed

I want to automatically automatically substitute the "number of levels" of the pedigree in a blupf90 parameter file. For that, I keep a basic.par:

DATAFILE

data

NUMBER_OF_TRAITS

1

NUMBER_OF_EFFECTS

2

OBSERVATION(S)

3

WEIGHT(S)

EFFECTS: POSITIONS_IN_DATAFILE NUMBER_OF_LEVELS TYPE_OF_EFFECT[EFFECT NESTED]

1 1 cross

2 30000 cross

RANDOM_RESIDUAL VALUES

0.7

 RANDOM_GROUP

2

 RANDOM_TYPE

add_animal

 FILE

ped

(CO)VARIANCES

0.3

then I read the actual number of levels in ped:

# get number in pedigree

n=$(wc -l ped | awk '{print $1}')

in my example, 2800

and I use sed to modify the 30000 to 'n':

# change 2nd number in line 13 to n

sed '13 s/[0-9]* /'$n' /2' < basic.par > new.par

I am not good with regular expressions, but I believe this means "in line 13, any number ([0-9] repeated from 0 to infinite times -that's what * means' and only substitute the 2nd occurrence". The spaces in the regular expression are by trial and error :-(
The little that I know about sed comes from the Grymoire .

Monday, March 14, 2016

split and substr in awk to handle genotypes

Genotype handling often involves loops across genotypes. In awk, I typically use loops. To pick one locus from the long string described in the previous post, we have two possibilities, substr() or split()

QmSim2uga.awk changes formats from output of QMSim to input of blupf90. Here is one such loop using substr() , which selects a substring from a string

for (i=1; i<=nsnp; i++){

                        out=substr($2,i,1)

                        # QmSim -> UGA

                        # 0 -> AA -> 0

                        # 2 -> aa -> 2

                        # 3 -> Aa -> 1

                        # 4 -> aA -> 1

                        # (missing does not exist) -> 5

                        if(out>2) {out=1}

                        genotype=genotype out

}

For 9 individuals, in my mac

time ./QmSim2uga.awk p1_mrk_001.txt >/dev/null

nsnp, nanim     78519         9

user 1m47.467s

this takes 2 min .

Alternatively, we split() the string into an array, then we loop through the array:

split($2,geno,"")

for (i=1; i<=nsnp; i++){

                        out=geno[i]

                        #substr($2,i,1)

                        # QmSim -> UGA

                        # 0 -> AA -> 0

                        # 2 -> aa -> 2

                        # 3 -> Aa -> 1

                        # 4 -> aA -> 1

                        # (missing does not exist) -> 5

                        if(out>2) {out=1}

                        genotype=genotype out

}

This form is waaaay faster:

user 0m0.285s

And with all individuals:

nsnp, nanim     78519      6037

real 3m16.279s

user 3m13.201s

Thursday, March 3, 2016

format genotypes for blupf90, GS3

Blupf90 and GS3 require genotypes to be in this form:

       345 1111212111212112

       346 1121111211211021

       347 2022222220202022

       348 1111111211211021

      1349 2022222220202022

     12350 1111212111212112

       351 1121111211211021

       352 1121111211211021

       353 2022222220202022

this also works

     345   1111212111212112

       346 1121111211211021

      347  2022222220202022

       348 1111111211211021

      1349 2022222220202022

     12350 1111212111212112

       349 2022222220202022

       350 1111212111212112

       351 1121111211211021

       352 1121111211211021

       353 2022222220202022

or this

345   1111212111212112

346   1121111211211021

347   2022222220202022

348   1111111211211021

1349  2022222220202022

12350 1111212111212112

this will be read erroneously and it will give wrong results:

345 1111212111212112

346 1121111211211021

347 2022222220202022

348 1111111211211021

1349 2022222220202022

12350 1111212111212112

Id's and genotypes (coded as 0/1/2) need to be separated by 1 or several spaces (not tabs) and genotypes need to start at exactly the same column.

A simple fix is to use awk.
Imagine that your genotype file is gene.txt :

$ cat gene.txt

345 1111212111212112

346 1121111211211021

347 2022222220202022

348 1111111211211021

1349 2022222220202022

12350 1111212111212112

then you can do

awk 'printf("%10s%1s%" length($2) "s\n",$1," ",$2) gene.txt >gene2.txt

On gene2.txt, things are formatted:

awk '{printf("%10s%1s%" length($2) "s\n",$1," ",$2)}' gene.txt >gene2.txt

cat gene2.txt 

       345 1111212111212112

       346 1121111211211021

       347 2022222220202022

       348 1111111211211021

      1349 2022222220202022

     12350 1111212111212112