Friday, December 9, 2016

Sires in validation and Sires in training

I want to test metafounders for genomic selection. So I have these MTR sires that have daughters in the validation , in the training, or in both of them, and I want a file with these numbers. I have two files:

T is:
juan 10
pepe 5


V is:

luis 6
pepe 4


so luis has no daughters on Training, and I want this to appear. Then I use this pipeline from Toni Reverter from CSIRO:

awk '{print $1}' $1 $2 | sort -u | join -a1 - $1 | awk '{print $1, (NF==2?$2:0)}' | join -a1 - $2 | awk '{print $1, $2, (NF==3?$3:0)}'


which gives

putterri:yarp andres$ ./joinToni.sh V T
juan 0 10
luis 6 0

pepe 4 5





Install GNU coreutils in the Mac

The MacOSX join is poorer than GNU one. So I try toinstall it using macports. First I have to update the data base of port:

putterri:mf andres$ port -v selfupdate

But this does not work. Then I try


putterri:mf andres$ sudo port -d selfupdate

This does work. Join is actually part of coreutils:


putterri:mf andres$ port search --name coreutils
coreutils @8.25 (sysutils)
    GNU File, Shell, and Text utilities

xml-coreutils @0.8.1_1 (textproc, xml)
    Command line tools for XML processing

Found 2 ports.

So I do install coreutils:


putterri:mf andres$ sudo port install coreutils

The most important information is at the end:

The tools provided by GNU coreutils are prefixed with the character 'g' by default to distinguish them from the BSD commands.
For example, cp becomes gcp and ls becomes gls.

If you want to use the GNU tools by default, add this directory to the front of your PATH environment variable:
    /opt/local/libexec/gnubin/


Saturday, November 19, 2016

Plotting correlations

I found this nice R package to plot correlations : corrplot . Imagine that you want to present five genetic correlations to your buddies. Here's how to visualize them nicely.

#create a covariance matrix
set.seed(1234)
V=rWishart(1,10,diag(5))[,,1]
cov2cor(V)
            [,1]       [,2]       [,3]       [,4]        [,5]
[1,]  1.00000000  0.1169636  0.2236465 -0.2667448  0.04351447
[2,]  0.11696358  1.0000000 -0.2260988 -0.3425368  0.64837326
[3,]  0.22364648 -0.2260988  1.0000000 -0.1883763 -0.22484960
[4,] -0.26674477 -0.3425368 -0.1883763  1.0000000 -0.50315007
[5,]  0.04351447  0.6483733 -0.2248496 -0.5031501  1.00000000
#give it names
colnames(V)=c("MY","gain","longevity","SCS","pietin")
rownames(V)=c("MY","gain","longevity","SCS","pietin")

#plot
require(corrplot)
corrplot.mixed(cor(V),col=gray.colors(10))


Thursday, October 27, 2016

French Keyboard on US Macbook Keyboard

I have a US-like keyboard with this key disposition


However, it happens to me to write in French changing keyboard preferences to French. The keys are mapped to a French keyboard.  I use a rubber cover from kbcovers.com, which is excellent and looks like this:


However, the physical layout of a true French macbok keyboard is slightly different. Also, and more impostant, some important characters such as | ~ # [ ] { } are not present, or difficult to see, in the cover.

So, I prepared my own cheatsheet:


Yes, it does look horrible and yes, it does work :-)

Tuesday, September 20, 2016

formatting SNPs using R or awk

In some software for genomic prediction (blupf90, GS3 and may be other) the genotypes shoukd be given in a plain text file as follows:

snp_file.txt

          25 1121022100
         600 0111220012
        1333 0110111111
           5 1120112102
          89 0111220001 


with no spaces between genotypes, id and genotypes separated by spaces - no tabs - and all genotypes starting at the same column. Sometimes it is not obvious how to get this format. Llibertat Tusell got a solution in R:

snps=sample(c(0,1,2),prob=c(.25,0.5,.25),size=50,replace=T)
X=matrix(ncol=10,nrow=5,snps)
animal=c(25,600,1333,5,89)

con <- file("snp_file.txt", "w")

for (i in 1:5){
  tmp=paste( X[i,] , collapse = "" )
  cat(format(animal[i],width=12),tmp,'\n',file=con)
}
close(con)


The R program collapses markers into a single string, then puts format to the animal "word" so that it has constant width, then it writes it to a file

Another solution is to use awk and start from a file, e.g.

$ cat exo_geno_spaces 
  1101 1 0 2 2 1 1 2
     101 2 2 1 1 2 1 1
   254 1 1 2 0 1 1 0

   255    2 1 2 2 0 1 2

Then you can use an awk program ./remove_spaces_snps.awk:

#! /opt/local/bin/gawk -f
# this script removes space between SNP genotypes
# and formatting as UGA
BEGIN{}
{
    # print animal
    printf( "%20s",$1)
    printf( "%1s"," ")

    for (i =2; i<=NF; i++){
      printf( "%1s",$i)
    }
    printf("\n")
 }
END{}

This awk program prints on stdout the animal (with constant width) then markers without sopace separation, then a newline.

$ ./remove_spaces_snps.awk exo_geno_spaces > out
$ cat out 

                1101 1022112
                 101 2211211
                 254 1120110

                 255 2122012

Monday, September 19, 2016

Livestock Fair: Pirenaica cattle

I was in a livestock fair in Irurtzun and I could take good pictures of local breeds. These are Pirenaica cows and calves. The Pirenaica breeding association is Conaspi and is has been the object of many scientific publications.





In the same village, there used to be a weekly livestock fair every tuesday until the 70's. In this old picture circa 1950 you can see some local animals (some of them Pirenaica) that were used, among other things, for working the fields. 






Thursday, September 15, 2016

Reordering matrix in R

I have this matrix with relationships across metafounders and years. However the matrix is disordered, let's say:

> g
     1998 2010 1970
1998 0.90 0.67 0.83
2010 0.67 0.92 0.52
1970 0.83 0.52 0.95

I would like to sort this matrix in ascending order, i.e. the cell corrresponding to [1970,1970] should go on top left.

This can be done using a loop but it is tricky. In R this is easy using sorted indices:

> g1=g[order(rownames(g)),order(colnames(g))]
> g1
     1970 1998 2010
1970 0.95 0.83 0.52
1998 0.83 0.90 0.67
2010 0.52 0.67 0.92



Tuesday, September 13, 2016

ifort options -openmp and -qopenmp

I work in three computers, two servers (let's call them grits and cassoulet) and my own Mac machine. Well, I found out that the options to use OpenMP in the Intel Fortran compiler (ifort) are different depending on the compiler version:

  • grits: ifort 15.0.3: -qopenmp
  • Mac: ifort 15.0.2: -qopenmp
but...

  • cassoulet: ifort 14.0.3: -openmp
According to the documentation for for ifort, the option -openmp is deprecated .


Friday, July 29, 2016

gawk in MobaXterm

MobaXterm is a nice unix/linux console emulator. We were trying to use it to run a script for genetic evaluation, yet in one of the awk scripts we used the sentence
 BEGIN {FIELDWIDTHS = "12 14 8 2 3 3 8 1 3 14 14 3 8 14" }

it seems that this FIELDWIDTHS is restricted to gawk, but not to all implementations of awk, in particular not to the one in MobaXterm. However MobaXterm has the possibility of installing programs and plugins. Thus I try

[andres.ANDRESLEGAR2422] → apt-get
apt-cyg: Installs and removes Cygwin packages.
  "apt-cyg install <package names>" to install packages

So, I install it

[andres.ANDRESLEGAR2422] → apt-get install gawk

Trying to download file setup.bz2
Updated setup.ini
Found package gawk
Downloading gawk-4.1.3-1.tar.xz...
Unpacking gawk-4.1.3-1.tar.xz...
Extracting dependencies for usr/bin/gawk.exe...
Extracting dependencies for usr/bin/gawk-4.1.3.exe...
Extracting dependencies for usr/libexec/awk/pwcat.exe...
Extracting dependencies for usr/libexec/awk/grcat.exe...
Extracting dependencies for usr/lib/gawk/filefuncs.dll...
Extracting dependencies for usr/lib/gawk/fnmatch.dll...
Extracting dependencies for usr/lib/gawk/fork.dll...
Extracting dependencies for usr/lib/gawk/inplace.dll...
Extracting dependencies for usr/lib/gawk/ordchr.dll...
Extracting dependencies for usr/lib/gawk/readdir.dll...
Extracting dependencies for usr/lib/gawk/readfile.dll...
Extracting dependencies for usr/lib/gawk/revoutput.dll...
Extracting dependencies for usr/lib/gawk/revtwoway.dll...
Extracting dependencies for usr/lib/gawk/rwarray.dll...
Extracting dependencies for usr/lib/gawk/testext.dll...
Extracting dependencies for usr/lib/gawk/time.dll...
Package gawk requires the following packages, installing bash cygwin libgcc1 libgmp10 libintl8 libmpfr4 libreadline7
Package bash is already installed, skipping
Package cygwin is already installed, skipping
Found package libgcc1
Package libgcc1 is already included, skipping
Found package libgmp10

Installing libgmp10
Downloading libgmp10-6.1.0-3p1.tar.xz...
Unpacking libgmp10-6.1.0-3p1.tar.xz...
Extracting dependencies for usr/bin/cyggmp-10.dll...
Package libgmp10 requires the following packages, installing cygwin
Package cygwin is already installed, skipping
Package libgmp10 installed.
Found package libintl8
Package libintl8 is already included, skipping
Found package libmpfr4

Installing libmpfr4
Downloading libmpfr4-3.1.4-1.tar.xz...
Unpacking libmpfr4-3.1.4-1.tar.xz...
Extracting dependencies for usr/bin/cygmpfr-4.dll...
Package libmpfr4 requires the following packages, installing cygwin libgcc1 libgmp10
Package cygwin is already installed, skipping
Found package libgcc1
Package libgcc1 is already included, skipping
Package libgmp10 is already installed, skipping
Package libmpfr4 installed.
Found package libreadline7
Package libreadline7 is already included, skipping
Package gawk installed.

Rebasing new libraries

Found package rebase

Installing rebase
Downloading rebase-4.4.2-1.tar.xz...
Unpacking rebase-4.4.2-1.tar.xz...
Extracting dependencies for usr/bin/rebase.exe...
Extracting dependencies for usr/bin/peflags.exe...
Package rebase requires the following packages, installing coreutils cygwin grep gzip sed
Found package coreutils
Package coreutils is already included, skipping
Package cygwin is already installed, skipping
Package grep is already installed, skipping
Found package gzip
Package gzip is already included, skipping
Found package sed
Package sed is already included, skipping
Package rebase installed.

Then I change in the scripts to use /bin/gawk.exe 
Now everything seems to work fine.


Wednesday, July 13, 2016

"new" equation editor in word

Since circa 2010 you can do nice things with the Equation Editor  (even for Mac) in Word and Power Point. For instance my notes on genomic prediction. Note that this is NOT the horrible Equation Editor that you think:


But a new one which has LaTex-y kind of input, using Math Autocorrect:

which is based on this. It works nicely in PowerPoint too:



 So, I completely gave up using MathType.

However, there are no good resources for this nrew Equation Editor, and the web of Math Autocorrect is not very good.

I used this documentati to learn (although it is a bit outdated):
http://school-maths.com/documents/Word_2010_stols.pdf .

You may also like these two titles:

The Word 2007/2010 Equation Editor
Using Keystrokes to Write EquationsIn Microsoft Office 2007 Equation Editor 


A cheatsheet with shortcuts.

There is also MathInOffice is a very nice resource but not exactly a tutorial.

Thursday, July 7, 2016

Working in Windows machines

To my taste, Windows is inefficient for serious number crunching. But if you need to do so, and you feel like doing things the Unix way, you may:


  1. install MobaXterm, a nice unix terminal emulator
  2. install notepad++, en excellent plain text editor
  3. get a linux/unix command cheatsheet (most commands are in MobaXterm):
    1. In Spanish: Comandos Unix/Linux – Guía de Referencia
    2. In French
    3. or in English
  4. I also recommend to use awk (although this is by no means absolute need). For an awk manual, the best one that I know is in Spanish, but there are many out there.
  5. And finally, R and gfortran .

There is always the question on how to open a terminal. Here it goes for Windows 10 and for Windows 7.

Saturday, May 28, 2016

R in Jupyter

First I had to install Macports for El Capitan:

https://www.macports.org/install.php


Then I followed the instrucions in: http://irkernel.github.io/requirements/

Using the "source" method.

New web page using Markdown

I have a web page hosted by Genotoul. I changed yesterday the ugly old one to a new one, much lighter. The main reason I changed is that I discovered how easy is to create html files using MarkDown with an appropriate text editor:

# This is a title

This is a text

* item1
* item2

Translates into

This is a title

This is a text
  • item1
  • item2


Sunday, May 8, 2016

Burguete horses and Pirenaica cattle

In Navarre: Burguete mare and foal, and Pirenaica cows.


Chisqueta sheep in Sobrarbe

Still a sheep breed that I heard about but never saw, the Chisqueta or Xisqueta sheep. Here are some references in aragonese and catalan, and here is one in Spanish.


Wednesday, May 4, 2016

Variance components with high numbers

I have this large data set in which I want to estimate heritability. These are the features for the trait:

nmiss   avg         min       max         var      
------------------------------------------------------------------------------

1703515 0 1772.570604    100.470000 7783.789000 524507.442442

So, the variance for the trait is very high. Blupf90 runs nicely. However, both airemlf90 and gibbs2f90 variance component programs in the suite gave me some trouble if I used. The reason is that the programs contain some "tolerance" parameter in the inverses of the updated variance components, and this seems to set some of the numbers to 0.

For instance, if divide the value of the performance by 10, I get this:


 * Start Gibbs iteration04-29-2016  16h 26m 59s 482
 round            1
 G
   768.75    
 G
   157.18    
 R
   0.0000    

 round            2

The reason for this is the scale and the large number of data. The sum of squares of the residual is
 ee=  0.16456E+10

when this is sent to the subroutine, its inverse is lower than 10^-9 and set to 0.

I finally standardized MY to a variance of 1. I put initial variances to $latex \sigma^2_g=\sigma^2_p=\sigma^2_e=1 $latex. After 1000 iterations of Gibbs, I get:

* End of iteration04-29-2016  12h 09m 52s 514
 ave G
  0.11236    
 SD G
  0.31784E-01
 ave G
  0.14050    
 SD G
  0.13210E-01
 ave R
  0.25338    
 SD R

  0.58904E-02

In this case I obtain correct estimates ($latex h^2 =0.22 $latex, repeatability of 0.5).

Standard errors of complex functions of variance components from REML Information matrix

In airemlf90 and other programs, a byproduct of the AIREML is the inverse of the Information Matrix that, asymptotically, describes the sampling covariance of the variance components. It can also be considered as an approximation to the posterior distribution of the parameters.

Getting standard errors of functions of the variance components is rather painful algebraically as one requires the delta method and Taylor series expansions. A much simpler method , implemented in as OPTION se_covar_function in airemlf90, comes through MonteCarlo sampling and was described by Meyer and Houle. Here I will show how this work.
This a model with three variance components, genetic, permanent and residual variance. Estimates from airemlf90 are in vector ahat:

#genetic, permanent, residual
ahat=c(
  0.11478,    
  0.13552,    
  0.25290,
  )

with AI matrix:

# inverse of AI matrix (Sampling Variance)
AI=matrix(c(
  0.16799E-05, -0.96486E-06, -0.82566E-08,
 -0.96486E-06,  0.96167E-06, -0.37113E-07,
 -0.82566E-08, -0.37113E-07,  0.10864E-06)

 ,ncol=3)

asymptotically, the "true" value has a sampling (or posterior) distribution a~N(ahat,AI), and getting samples from this distribution is easy:

require(MASS)
b=mvrnorm(10000,ahat,AI)


> head(b)
          [,1]      [,2]      [,3]
[1,] 0.1146738 0.1357640 0.2529399
[2,] 0.1163889 0.1342926 0.2528479
[3,] 0.1166155 0.1344342 0.2525161
[4,] 0.1142085 0.1358928 0.2534974
[5,] 0.1136835 0.1361108 0.2530133
[6,] 0.1140485 0.1365707 0.2530573

and from here, we can obtain samples of the heritability and its standard deviation:

h2=b[,1]/(b[,1]+b[,2]+b[,3])
sd(h2)

> 0.002318198

This is very useful, i.e. for complex model such as consideration of genomic and dominance variation.

Friday, April 29, 2016

Equations in blogger

http://irrep.blogspot.fr/2011/07/mathjax-in-blogger-ii.html 

Seems to work: $h^2=1$

UPDATE: it does work but then it messes up the $'s corresponding to awk expressions so I desactivate it.

Monday, April 25, 2016

Estimation of allelic base frequencies

Tao Xiang is estimating base frequencies in a pig population. A simple but efficient method is Gengler et al. 2007 . This is actually very similar to GCMTBLUP, nut the latter considers one locus and an associated trait. Thus, Tao implement the model g=1m+Za+e  using blupf90 software, where m=2p. The data file, with genotype coded as 0/1/2 looks like

6051 1 2
6930 1 2
5205 1 2
7478 1 0

3208 1 1

Using direct inversion by fspak, the solution is p=0.88 . However, using the default iterative solver of blupf90 we get 0.44 !! Actually, using a more strict convergence criteria we get the exact solution:

  convcrit iter      20*m         p
1       12    5  9.016301 0.4508151
2       13    5  9.016301 0.4508151
3       14    9  9.016345 0.4508172
4       15   11  9.016358 0.4508179
5       16  103 16.779241 0.8389621
6       17  110 17.503890 0.8751945
7       18  117 17.570641 0.8785320
8       19  124 17.574420 0.8787210

9       20  126 17.574496 0.8787248

where the convergence criteria is actually 10^(-convcrit). So, the iterator is stuck at 0.45. The reason for this is that heritability is 1 so that for a given mu, EBVs for genotyped animals almost do not change and viceversa.

This is how the correct estimate of EBV looks like:
There is no pathology in this graph. The first 7000 individuals have genotype, so the breeding value is simply the deviation of the genotype from the populational mean. The three bars are individuals in genotypes AA, Aa and aa. The rest of individuals are ancestors, vaguely recoded in ascending order, so that their genotype is not neatly estimated.

This is how the wrong estimate of EBV looks like:
With biased EBVs for genotyped animals and, for non genotyped animals, EBVs quite different from the previous graph.

Take home message: if you estimate base allele frequencies using Gengler method by BLUP, use either solutions by inversion or set the convergence criteria to strict values.


Monday, April 4, 2016

Installing Julia and Jupyter in Windows 7

Rohan Fernando came to INRA Toulose last two weeks and he showed us Julia and Jupyter. They look very nice. The instructions to install both are here and here. I installed them in my Mac without much trouble but in the Windows 7 machine of Jean Michel Elsen was more difficult. There were issues related to WinRPM that I did not understand.


To install in JM computer I did
  1. Clean up everything, desinstall Julia, Anaconda, etc if already installed; delete folders .Julia, .Jupyter, etc.
  2. Install Julia
  3. Install Anaconda (Python >3)
  4. Start Julia. In its command line, Pkg.add("IJUlia") [ this is not clearly documented]. This seems to install Jupyter automatically. Let Julia open.
  5. Open a command line in Windows (in "Accesories"), then write jupyter notebook. The default navigator (Firefox) should open.
It works correctly in the Windows 7 Virtual Machine in my Mac

Tuesday, March 22, 2016

substitute number with sed

I want to automatically automatically substitute the "number of levels" of the pedigree in a blupf90 parameter file. For that, I keep a basic.par:

DATAFILE
data
NUMBER_OF_TRAITS
           1
NUMBER_OF_EFFECTS
           2
OBSERVATION(S)
3
WEIGHT(S)

EFFECTS: POSITIONS_IN_DATAFILE NUMBER_OF_LEVELS TYPE_OF_EFFECT[EFFECT NESTED]
1 1 cross
2 30000 cross
RANDOM_RESIDUAL VALUES
0.7
 RANDOM_GROUP
           2
 RANDOM_TYPE
add_animal
 FILE
ped                                                
(CO)VARIANCES
0.3

then I read the actual number of levels in ped:


# get number in pedigree

n=$(wc -l ped | awk '{print $1}')

in my example, 2800
and I use sed to modify the 30000 to 'n':


# change 2nd number in line 13 to n 

sed  '13 s/[0-9]* /'$n' /2' < basic.par > new.par

I am not good with regular expressions, but I believe this means "in line 13, any number ([0-9] repeated from 0 to infinite times -that's what * means' and only substitute the 2nd occurrence".  The spaces in the regular expression are by trial and error :-(  
The little that I know about sed comes from the Grymoire .

Monday, March 14, 2016

split and substr in awk to handle genotypes

Genotype handling often involves loops across genotypes. In awk, I typically use loops. To pick one locus from the long string described in the previous post, we have two possibilities, substr() or split()

QmSim2uga.awk changes formats from output of QMSim to input of blupf90. Here is one such loop using substr() , which selects a substring from a string

for (i=1; i<=nsnp; i++){
                        out=substr($2,i,1)
                        # QmSim -> UGA
                        # 0 -> AA -> 0
                        # 2 -> aa -> 2
                        # 3 -> Aa -> 1
                        # 4 -> aA -> 1
                        # (missing does not exist) -> 5
                        if(out>2) {out=1}
                        genotype=genotype out

                }

For 9 individuals, in my mac
time ./QmSim2uga.awk p1_mrk_001.txt >/dev/null
nsnp, nanim     78519         9
user 1m47.467s

this takes 2 min .

Alternatively, we split() the string into an array, then we loop through the array:

split($2,geno,"")
for (i=1; i<=nsnp; i++){
                        out=geno[i]
                        #substr($2,i,1)
                        # QmSim -> UGA
                        # 0 -> AA -> 0
                        # 2 -> aa -> 2
                        # 3 -> Aa -> 1
                        # 4 -> aA -> 1
                        # (missing does not exist) -> 5
                        if(out>2) {out=1}
                        genotype=genotype out

                }
This form is waaaay faster:
user 0m0.285s
And with all individuals:
nsnp, nanim     78519      6037

real 3m16.279s
user 3m13.201s

Thursday, March 3, 2016

format genotypes for blupf90, GS3

Blupf90 and GS3 require genotypes to be in this form:

       345 1111212111212112
       346 1121111211211021
       347 2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112
       351 1121111211211021
       352 1121111211211021

       353 2022222220202022

this also works

     345   1111212111212112
       346 1121111211211021
      347  2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112
       349 2022222220202022
       350 1111212111212112
       351 1121111211211021
       352 1121111211211021

       353 2022222220202022

or this

345   1111212111212112
346   1121111211211021
347   2022222220202022
348   1111111211211021
1349  2022222220202022
12350 1111212111212112

 this will be read erroneously and it will give wrong results:

345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112


Id's and genotypes (coded as 0/1/2) need to be separated by 1 or several spaces (not tabs) and genotypes need to start at exactly the same column.

A simple fix is to use awk.
Imagine that your genotype file is gene.txt :
$ cat gene.txt
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112


then you can do
awk 'printf("%10s%1s%" length($2) "s\n",$1," ",$2) gene.txt >gene2.txt

On gene2.txt, things are formatted:

awk '{printf("%10s%1s%" length($2) "s\n",$1," ",$2)}' gene.txt >gene2.txt

cat gene2.txt 
       345 1111212111212112
       346 1121111211211021
       347 2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112