snp_file.txt
25 1121022100
600 0111220012
1333 0110111111
5 1120112102
89 0111220001
with no spaces between genotypes, id and genotypes separated by spaces - no tabs - and all genotypes starting at the same column. Sometimes it is not obvious how to get this format. Llibertat Tusell got a solution in R:
snps=sample(c(0,1,2),prob=c(.25,0.5,.25),size=50,replace=T)
X=matrix(ncol=10,nrow=5,snps)
animal=c(25,600,1333,5,89)
con <- file("snp_file.txt", "w")
for (i in 1:5){
tmp=paste( X[i,] , collapse = "" )
cat(format(animal[i],width=12),tmp,'\n',file=con)
}
close(con)
The R program collapses markers into a single string, then puts format to the animal "word" so that it has constant width, then it writes it to a file
Another solution is to use awk and start from a file, e.g.
$ cat exo_geno_spaces
1101 1 0 2 2 1 1 2
101 2 2 1 1 2 1 1
254 1 1 2 0 1 1 0
255 2 1 2 2 0 1 2
Then you can use an awk program ./remove_spaces_snps.awk:
#! /opt/local/bin/gawk -f
# this script removes space between SNP genotypes
# and formatting as UGA
BEGIN{}
{
# print animal
printf( "%20s",$1)
printf( "%1s"," ")
for (i =2; i<=NF; i++){
printf( "%1s",$i)
}
printf("\n")
}
END{}
This awk program prints on stdout the animal (with constant width) then markers without sopace separation, then a newline.
$ ./remove_spaces_snps.awk exo_geno_spaces > out
1101 1022112
101 2211211
254 1120110
255 2122012
ReplyDeleteHere another solution using shell one-liners
# remove leading whitespaces
sed -e 's/^[[:space:]]*//' exo_geno_spaces > tmp
# collapse genotypes
cut -d' ' -f2- tmp | tr -d ' ' | \
paste <(cut -d' ' -f1 tmp) - | \
awk '{ printf "%-20s %s\n", $1, $2 }' > out
rm tmp