Tuesday, September 20, 2016

formatting SNPs using R or awk

In some software for genomic prediction (blupf90, GS3 and may be other) the genotypes shoukd be given in a plain text file as follows:

snp_file.txt

          25 1121022100
         600 0111220012
        1333 0110111111
           5 1120112102
          89 0111220001 


with no spaces between genotypes, id and genotypes separated by spaces - no tabs - and all genotypes starting at the same column. Sometimes it is not obvious how to get this format. Llibertat Tusell got a solution in R:

snps=sample(c(0,1,2),prob=c(.25,0.5,.25),size=50,replace=T)
X=matrix(ncol=10,nrow=5,snps)
animal=c(25,600,1333,5,89)

con <- file("snp_file.txt", "w")

for (i in 1:5){
  tmp=paste( X[i,] , collapse = "" )
  cat(format(animal[i],width=12),tmp,'\n',file=con)
}
close(con)


The R program collapses markers into a single string, then puts format to the animal "word" so that it has constant width, then it writes it to a file

Another solution is to use awk and start from a file, e.g.

$ cat exo_geno_spaces 
  1101 1 0 2 2 1 1 2
     101 2 2 1 1 2 1 1
   254 1 1 2 0 1 1 0

   255    2 1 2 2 0 1 2

Then you can use an awk program ./remove_spaces_snps.awk:

#! /opt/local/bin/gawk -f
# this script removes space between SNP genotypes
# and formatting as UGA
BEGIN{}
{
    # print animal
    printf( "%20s",$1)
    printf( "%1s"," ")

    for (i =2; i<=NF; i++){
      printf( "%1s",$i)
    }
    printf("\n")
 }
END{}

This awk program prints on stdout the animal (with constant width) then markers without sopace separation, then a newline.

$ ./remove_spaces_snps.awk exo_geno_spaces > out
$ cat out 

                1101 1022112
                 101 2211211
                 254 1120110

                 255 2122012

1 comment:


  1. Here another solution using shell one-liners

    # remove leading whitespaces
    sed -e 's/^[[:space:]]*//' exo_geno_spaces > tmp

    # collapse genotypes
    cut -d' ' -f2- tmp | tr -d ' ' | \
    paste <(cut -d' ' -f1 tmp) - | \
    awk '{ printf "%-20s %s\n", $1, $2 }' > out
    rm tmp

    ReplyDelete