artadia: split and substr in awk to handle genotypes

Monday, March 14, 2016

split and substr in awk to handle genotypes

Genotype handling often involves loops across genotypes. In awk, I typically use loops. To pick one locus from the long string described in the previous post, we have two possibilities, substr() or split()

QmSim2uga.awk changes formats from output of QMSim to input of blupf90. Here is one such loop using substr() , which selects a substring from a string

for (i=1; i<=nsnp; i++){

                        out=substr($2,i,1)

                        # QmSim -> UGA

                        # 0 -> AA -> 0

                        # 2 -> aa -> 2

                        # 3 -> Aa -> 1

                        # 4 -> aA -> 1

                        # (missing does not exist) -> 5

                        if(out>2) {out=1}

                        genotype=genotype out

}

For 9 individuals, in my mac

time ./QmSim2uga.awk p1_mrk_001.txt >/dev/null

nsnp, nanim     78519         9

user 1m47.467s

this takes 2 min .

Alternatively, we split() the string into an array, then we loop through the array:

split($2,geno,"")

for (i=1; i<=nsnp; i++){

                        out=geno[i]

                        #substr($2,i,1)

                        # QmSim -> UGA

                        # 0 -> AA -> 0

                        # 2 -> aa -> 2

                        # 3 -> Aa -> 1

                        # 4 -> aA -> 1

                        # (missing does not exist) -> 5

                        if(out>2) {out=1}

                        genotype=genotype out

}

This form is waaaay faster:

user 0m0.285s

And with all individuals:

nsnp, nanim     78519      6037

real 3m16.279s

user 3m13.201s

artadia

Monday, March 14, 2016

split and substr in awk to handle genotypes

No comments:

Post a Comment