artadia: March 2016

Tuesday, March 22, 2016

substitute number with sed

I want to automatically automatically substitute the "number of levels" of the pedigree in a blupf90 parameter file. For that, I keep a basic.par:

DATAFILE

data

NUMBER_OF_TRAITS

1

NUMBER_OF_EFFECTS

2

OBSERVATION(S)

3

WEIGHT(S)

EFFECTS: POSITIONS_IN_DATAFILE NUMBER_OF_LEVELS TYPE_OF_EFFECT[EFFECT NESTED]

1 1 cross

2 30000 cross

RANDOM_RESIDUAL VALUES

0.7

 RANDOM_GROUP

2

 RANDOM_TYPE

add_animal

 FILE

ped

(CO)VARIANCES

0.3

then I read the actual number of levels in ped:

# get number in pedigree

n=$(wc -l ped | awk '{print $1}')

in my example, 2800

and I use sed to modify the 30000 to 'n':

# change 2nd number in line 13 to n

sed '13 s/[0-9]* /'$n' /2' < basic.par > new.par

I am not good with regular expressions, but I believe this means "in line 13, any number ([0-9] repeated from 0 to infinite times -that's what * means' and only substitute the 2nd occurrence". The spaces in the regular expression are by trial and error :-(
The little that I know about sed comes from the Grymoire .

Monday, March 14, 2016

split and substr in awk to handle genotypes

Genotype handling often involves loops across genotypes. In awk, I typically use loops. To pick one locus from the long string described in the previous post, we have two possibilities, substr() or split()

QmSim2uga.awk changes formats from output of QMSim to input of blupf90. Here is one such loop using substr() , which selects a substring from a string

for (i=1; i<=nsnp; i++){

                        out=substr($2,i,1)

                        # QmSim -> UGA

                        # 0 -> AA -> 0

                        # 2 -> aa -> 2

                        # 3 -> Aa -> 1

                        # 4 -> aA -> 1

                        # (missing does not exist) -> 5

                        if(out>2) {out=1}

                        genotype=genotype out

}

For 9 individuals, in my mac

time ./QmSim2uga.awk p1_mrk_001.txt >/dev/null

nsnp, nanim     78519         9

user 1m47.467s

this takes 2 min .

Alternatively, we split() the string into an array, then we loop through the array:

split($2,geno,"")

for (i=1; i<=nsnp; i++){

                        out=geno[i]

                        #substr($2,i,1)

                        # QmSim -> UGA

                        # 0 -> AA -> 0

                        # 2 -> aa -> 2

                        # 3 -> Aa -> 1

                        # 4 -> aA -> 1

                        # (missing does not exist) -> 5

                        if(out>2) {out=1}

                        genotype=genotype out

}

This form is waaaay faster:

user 0m0.285s

And with all individuals:

nsnp, nanim     78519      6037

real 3m16.279s

user 3m13.201s

Thursday, March 3, 2016

format genotypes for blupf90, GS3

Blupf90 and GS3 require genotypes to be in this form:

       345 1111212111212112

       346 1121111211211021

       347 2022222220202022

       348 1111111211211021

      1349 2022222220202022

     12350 1111212111212112

       351 1121111211211021

       352 1121111211211021

       353 2022222220202022

this also works

     345   1111212111212112

       346 1121111211211021

      347  2022222220202022

       348 1111111211211021

      1349 2022222220202022

     12350 1111212111212112

       349 2022222220202022

       350 1111212111212112

       351 1121111211211021

       352 1121111211211021

       353 2022222220202022

or this

345   1111212111212112

346   1121111211211021

347   2022222220202022

348   1111111211211021

1349  2022222220202022

12350 1111212111212112

this will be read erroneously and it will give wrong results:

345 1111212111212112

346 1121111211211021

347 2022222220202022

348 1111111211211021

1349 2022222220202022

12350 1111212111212112

Id's and genotypes (coded as 0/1/2) need to be separated by 1 or several spaces (not tabs) and genotypes need to start at exactly the same column.

A simple fix is to use awk.
Imagine that your genotype file is gene.txt :

$ cat gene.txt

345 1111212111212112

346 1121111211211021

347 2022222220202022

348 1111111211211021

1349 2022222220202022

12350 1111212111212112

then you can do

awk 'printf("%10s%1s%" length($2) "s\n",$1," ",$2) gene.txt >gene2.txt

On gene2.txt, things are formatted:

awk '{printf("%10s%1s%" length($2) "s\n",$1," ",$2)}' gene.txt >gene2.txt

cat gene2.txt 

       345 1111212111212112

       346 1121111211211021

       347 2022222220202022

       348 1111111211211021

      1349 2022222220202022

     12350 1111212111212112

Bivariate plots in R with many many pairs

Bivariate plots with many many pairs

Think that we need to plot many pairs, e.g. comparing A vs. G. What we can do in R:

#generate two correlated variables
a=rnorm(1000000)
b=a+rnorm(a)
plot is rather horrible

plot(a,b)

We can draw a map with level curves
require(MASS)
contour(kde2d(a,b))

But I find that hexagonal binning is even nicer (and very fast !!)
require(hexbin)
tt=hexbin(a,b)
plot(tt)

Finally, we have persp:
persp(kde2d(a,b))