Thursday, March 3, 2016

format genotypes for blupf90, GS3

Blupf90 and GS3 require genotypes to be in this form:

       345 1111212111212112
       346 1121111211211021
       347 2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112
       351 1121111211211021
       352 1121111211211021

       353 2022222220202022

this also works

     345   1111212111212112
       346 1121111211211021
      347  2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112
       349 2022222220202022
       350 1111212111212112
       351 1121111211211021
       352 1121111211211021

       353 2022222220202022

or this

345   1111212111212112
346   1121111211211021
347   2022222220202022
348   1111111211211021
1349  2022222220202022
12350 1111212111212112

 this will be read erroneously and it will give wrong results:

345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112


Id's and genotypes (coded as 0/1/2) need to be separated by 1 or several spaces (not tabs) and genotypes need to start at exactly the same column.

A simple fix is to use awk.
Imagine that your genotype file is gene.txt :
$ cat gene.txt
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112


then you can do
awk 'printf("%10s%1s%" length($2) "s\n",$1," ",$2) gene.txt >gene2.txt

On gene2.txt, things are formatted:

awk '{printf("%10s%1s%" length($2) "s\n",$1," ",$2)}' gene.txt >gene2.txt

cat gene2.txt 
       345 1111212111212112
       346 1121111211211021
       347 2022222220202022
       348 1111111211211021
      1349 2022222220202022
     12350 1111212111212112



No comments:

Post a Comment