345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112
351 1121111211211021
352 1121111211211021
353 2022222220202022
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112
349 2022222220202022
350 1111212111212112
351 1121111211211021
352 1121111211211021
353 2022222220202022
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112
this will be read erroneously and it will give wrong results:
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112
Id's and genotypes (coded as 0/1/2) need to be separated by 1 or several spaces (not tabs) and genotypes need to start at exactly the same column.
A simple fix is to use awk.
Imagine that your genotype file is gene.txt :
then you can do
Imagine that your genotype file is gene.txt :
$ cat gene.txt
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112
then you can do
awk 'printf("%10s%1s%" length($2) "s\n",$1," ",$2) gene.txt >gene2.txt
On gene2.txt, things are formatted:
awk '{printf("%10s%1s%" length($2) "s\n",$1," ",$2)}' gene.txt >gene2.txt
cat gene2.txt
345 1111212111212112
346 1121111211211021
347 2022222220202022
348 1111111211211021
1349 2022222220202022
12350 1111212111212112
No comments:
Post a Comment