Tuesday, November 27, 2018

How to split a large SNP file into individual chunks


To convert this (one animal per line, UGA format as explained here and here)


         1 012202001001000020202000
         2 012202001001000021211110
         3 012202001001000020202000
         4 012211001100010020202000
         5 111211101100010110111001
         6 002202000000000021211110
         7 012211001100010021211110
         8 022220002200020020202000
         9 012202001001000020202000
        10 111211101100010110111001

into as many individual SNP files as loci,e.g.


zcat singlemarker/x000001.gz | head
        1   1
        2   1
        3   1
        4   1
        5   2
        6   1
        7   1
        8   1
        9   1
       10   2

zcat singlemarker/x000002.gz | head
         1   2
         2   2
         3   2
         4   2
         5   2
         6   1
         7   2
         8   3
         9   2
        10   2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001
1111101211
2222122221
2222222222
0001101201


e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:


split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use  GNU fold to insert a newline after each character:

zcat  singlemarker/x000001.gz | fold -w 1 

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples


for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010