Wednesday, January 25, 2017

transposing marker data

Quite frequently one may find marker data ordered in this way:

alegarra@genotoul2 ~/save/progs $ cat ex_rachel
 1 a b
 1 b b
 1 c c
 2 b b
 2 c c
 2 d d
 3 a b
 3 b b
 3 c c

e.g. there are three animals (1 to 3) and three markers. Rachel and I would like them to be formatted one animal per line and markers one after each other, in this way:


alegarra@genotoul2 ~/save/progs $ cat out
         1 ab bb cc 
         2 bb cc dd 
         3 ab bb cc 

This is conceptually simple if animals are sorted:
  1. Read a line
  2. If the animal is the same as the old one, print the markers after the previous one
  3. If the animal is different, start a new line, print the animal and the markers.
  4. Add special cases for the first and last line
Here is an awk implementation


alegarra@genotoul2 ~/save/progs $ cat SNPcol2line.awk 
#! /bin/awk -f
# this program reads genotypes in one line per locus
# then puts them as
# individual allele1 allele2 allele1 allele 2
#
BEGIN{
idold=0
i=0
}
{
id=$1
# if new animal
if(id!=idold){
if(idold!=0){
# close previous line
printf("\n")
}
# write new id
printf("%10s%1s",id," ")
idold=id
}
printf("%1s%1s%1s",$2,$3," ")
}
END{
# last individual
printf("\n")
}

which works:

alegarra@genotoul2 ~/save/progs $ ./SNPcol2line.awk ex_rachel 
         1 ab bb cc 
         2 bb cc dd 
         3 ab bb cc