Monday, March 14, 2016

split and substr in awk to handle genotypes

Genotype handling often involves loops across genotypes. In awk, I typically use loops. To pick one locus from the long string described in the previous post, we have two possibilities, substr() or split()

QmSim2uga.awk changes formats from output of QMSim to input of blupf90. Here is one such loop using substr() , which selects a substring from a string

for (i=1; i<=nsnp; i++){
                        out=substr($2,i,1)
                        # QmSim -> UGA
                        # 0 -> AA -> 0
                        # 2 -> aa -> 2
                        # 3 -> Aa -> 1
                        # 4 -> aA -> 1
                        # (missing does not exist) -> 5
                        if(out>2) {out=1}
                        genotype=genotype out

                }

For 9 individuals, in my mac
time ./QmSim2uga.awk p1_mrk_001.txt >/dev/null
nsnp, nanim     78519         9
user 1m47.467s

this takes 2 min .

Alternatively, we split() the string into an array, then we loop through the array:

split($2,geno,"")
for (i=1; i<=nsnp; i++){
                        out=geno[i]
                        #substr($2,i,1)
                        # QmSim -> UGA
                        # 0 -> AA -> 0
                        # 2 -> aa -> 2
                        # 3 -> Aa -> 1
                        # 4 -> aA -> 1
                        # (missing does not exist) -> 5
                        if(out>2) {out=1}
                        genotype=genotype out

                }
This form is waaaay faster:
user 0m0.285s
And with all individuals:
nsnp, nanim     78519      6037

real 3m16.279s
user 3m13.201s

No comments:

Post a Comment