artadia: November 2018

Tuesday, November 27, 2018

How to split a large SNP file into individual chunks

To convert this (one animal per line, UGA format as explained here and here)

1 012202001001000020202000

2 012202001001000021211110

3 012202001001000020202000

4 012211001100010020202000

5 111211101100010110111001

6 002202000000000021211110

7 012211001100010021211110

8 022220002200020020202000

9 012202001001000020202000

10 111211101100010110111001

into as many individual SNP files as loci,e.g.

zcat singlemarker/x000001.gz | head

1 1

2 1

3 1

4 1

5 2

6 1

7 1

8 1

9 1

10 2

zcat singlemarker/x000002.gz | head

1 2

2 2

3 2

4 2

5 2

6 1

7 2

8 3

9 2

10 2

How to do it efficiently?

1-transpose the file (in my case via Fortran program) to:

0000100001

1111101211

2222122221

2222222222

0001101201

e.g. the first line is the first marker, and so on

2-use the extraordinary GNU split to split into files, one line (=one marker) at a time:

split -l 1 -d BB.700K.gen_transposed -a 6 --numeric-suffixes=1 --filter='gzip >$FILE.gz'

This command splits one line at a time (-l 1) creates a series of files with numeric suffixes starting in 1 (--numeric-suffixes=1) of width 6 (-a 6) like x000002.gz and finnaly "piped" through gzip to obtain compressed files (--filter='gzip >$FILE.gz')

Still the files look

0000100001

1111101211

2222122221

3-use GNU fold to insert a newline after each character:

zcat singlemarker/x000001.gz | fold -w 1

format bash script

from https://linuxconfig.org/bash-printf-syntax-basics-with-examples

for i in $( seq 1 10 ); do printf "%03d\t" "$i"; done
001     002     003     004     005     006     007     008     009     010