Monday, January 17, 2022

check format of marker file for blupf90 programs

 The file with genotypes for blupf90 needs to be fixed width separated by spaces. For instance:

    toto 011220001121

  pepepe 012112011012

and for sure, the number of markers need to be the same. This means that all lines need to be the same length and the number of markers too, and the markers need to be right-aligned.

It is not obvious how to check this in a long file. I have found this:

awk 'BEGIN{FS=";"};{l0=length($0);  split($0,a," "); l2=length(a[2]); l1= l0-l2 ; print l2,l0}' markerfile | sort | uniq -c

where markerfile is a file with markers. The script first reads the whole line and computes its length. Then it splits (assuming there are only 2 columns) and checks the length of the second column. It gives something like this:

  29753 38523 38540

where 29753 is the number of lines that show the pattern next to it (38523 markers in a line of 38540 width). This (one format only) is fine.

If things are not properly done we can get this:

  27914 38523 38543

   1839 38523 38546

where there are always 38523 markers, but 27914 lines have a total length of 38543 and 1839 have a longer length of 38546. This (more than one format) is NOT fine and will not be read properly.