Tuesday, March 22, 2016

substitute number with sed

I want to automatically automatically substitute the "number of levels" of the pedigree in a blupf90 parameter file. For that, I keep a basic.par:

DATAFILE
data
NUMBER_OF_TRAITS
           1
NUMBER_OF_EFFECTS
           2
OBSERVATION(S)
3
WEIGHT(S)

EFFECTS: POSITIONS_IN_DATAFILE NUMBER_OF_LEVELS TYPE_OF_EFFECT[EFFECT NESTED]
1 1 cross
2 30000 cross
RANDOM_RESIDUAL VALUES
0.7
 RANDOM_GROUP
           2
 RANDOM_TYPE
add_animal
 FILE
ped                                                
(CO)VARIANCES
0.3

then I read the actual number of levels in ped:


# get number in pedigree

n=$(wc -l ped | awk '{print $1}')

in my example, 2800
and I use sed to modify the 30000 to 'n':


# change 2nd number in line 13 to n 

sed  '13 s/[0-9]* /'$n' /2' < basic.par > new.par

I am not good with regular expressions, but I believe this means "in line 13, any number ([0-9] repeated from 0 to infinite times -that's what * means' and only substitute the 2nd occurrence".  The spaces in the regular expression are by trial and error :-(  
The little that I know about sed comes from the Grymoire .

Monday, March 14, 2016

split and substr in awk to handle genotypes

Genotype handling often involves loops across genotypes. In awk, I typically use loops. To pick one locus from the long string described in the previous post, we have two possibilities, substr() or split()

QmSim2uga.awk changes formats from output of QMSim to input of blupf90. Here is one such loop using substr() , which selects a substring from a string

for (i=1; i<=nsnp; i++){
                        out=substr($2,i,1)
                        # QmSim -> UGA
                        # 0 -> AA -> 0
                        # 2 -> aa -> 2
                        # 3 -> Aa -> 1
                        # 4 -> aA -> 1
                        # (missing does not exist) -> 5
                        if(out>2) {out=1}
                        genotype=genotype out

                }

For 9 individuals, in my mac
time ./QmSim2uga.awk p1_mrk_001.txt >/dev/null
nsnp, nanim     78519         9
user 1m47.467s

this takes 2 min .

Alternatively, we split() the string into an array, then we loop through the array:

split($2,geno,"")
for (i=1; i<=nsnp; i++){
                        out=geno[i]
                        #substr($2,i,1)
                        # QmSim -> UGA
                        # 0 -> AA -> 0
                        # 2 -> aa -> 2
                        # 3 -> Aa -> 1
                        # 4 -> aA -> 1
                        # (missing does not exist) -> 5
                        if(out>2) {out=1}
                        genotype=genotype out

                }
This form is waaaay faster:
user 0m0.285s
And with all individuals:
nsnp, nanim     78519      6037

real 3m16.279s
user 3m13.201s

Thursday, March 3, 2016

format genotypes for blupf90, GS3

Blupf90 and GS3 require genotypes to be in this form:

       345 1111212111212112
       346 1121111211211021
       347 2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112
       351 1121111211211021
       352 1121111211211021

       353 2022222220202022

this also works

     345   1111212111212112
       346 1121111211211021
      347  2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112
       349 2022222220202022
       350 1111212111212112
       351 1121111211211021
       352 1121111211211021

       353 2022222220202022

or this

345   1111212111212112
346   1121111211211021
347   2022222220202022
348   1111111211211021
1349  2022222220202022
12350 1111212111212112

 this will be read erroneously and it will give wrong results:

345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112


Id's and genotypes (coded as 0/1/2) need to be separated by 1 or several spaces (not tabs) and genotypes need to start at exactly the same column.

A simple fix is to use awk.
Imagine that your genotype file is gene.txt :
$ cat gene.txt
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112


then you can do
awk 'printf("%10s%1s%" length($2) "s\n",$1," ",$2) gene.txt >gene2.txt

On gene2.txt, things are formatted:

awk '{printf("%10s%1s%" length($2) "s\n",$1," ",$2)}' gene.txt >gene2.txt

cat gene2.txt 
       345 1111212111212112
       346 1121111211211021
       347 2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112



Bivariate plots in R with many many pairs

Bivariate plots with many many pairs

Think that we need to plot many pairs, e.g. comparing A vs. G. What we can do in R:

#generate two correlated variables
a=rnorm(1000000)
b=a+rnorm(a)

plot is rather horrible






















plot(a,b)

We can draw a map with level curves
 require(MASS)
 contour(kde2d(a,b))


But I find that hexagonal binning is even nicer (and very fast !!)
require(hexbin)
tt=hexbin(a,b)
plot(tt)

Finally, we have persp:
persp(kde2d(a,b))