Friday, June 5, 2020

Nucleotide to A/B notation in Illumina SNP genotypes

For the project Smarter we need to exchange genotypes and we thought in exchanging Illumina's A/B allele that avoids ambiguity in the definition of the DNA strand that is being read. This is described in this document. However our files, after some quality control, have A/C/G/T nucleotides, like this (ovine):


Sample ID       Sample Name     SNP Name        Allele1 - Top   Allele2 - Top   GC Score
ES140000270478  PLACA_CIC_12_96 250506CS3900140500001_312.1     A       G       0.7341
ES140000270478  PLACA_CIC_12_96 250506CS3900065000002_1238.1    G       G       0.8932

Some times we have both the nucleotide and the A/B (this example is bovine):

SNP Name Sample ID Allele1 - Forward Allele2 - Forward Allele1 - Top Allele2 - Top Allele1 - AB Allele2 - AB GC Score X Y
ARS-BFGL-BAC-10172 USA201811 G G G G B B 0.9506 0.012 1.036
ARS-BFGL-BAC-1020 USA201811 G G G G B B 0.9673 0.005 0.652


We are then now in the inverse position of having to convert back from A/C/G/T to A/B.
This depends on whether the strand read was a TOP strand or a BOT (bottom) strand, which depends on the particular locus. The rules depend on the possible genotypes at the locus.

For some of them it does not matter if the strand is TOP or BOT:

A/G -> A/B
A/C -> A/B
T/G -> A/B
T/C -> A/B

For other locus it does depend:
  • For loci in TOP strands

A/T -> A/B
G/C -> B/A


  • For loci in BOT strands


A/T -> B/A
G/C -> A/B

How to find if the locus is TOP or BOT? I am so inept in molecular genetics that I didn't know how to find this information (it must be in some database somewhere) but I found it in one of the files that comes "raw" genotypes and that has "Locus Summary" and looks like:

Locus Summary on ...
Row,Locus_Name,Illumicode_Name,...,Plus/Minus Strand
1,250506CS3900065000002_1238.1,49668394,...,TOP
2,250506CS3900140500001_312.1,29623404,...,TOP

so this says, for each locus in the chip, if the locus is in a TOP or in a BOT strand. All our loci in ovine genotypings are TOP (there must be a reason for that ?).  So for instance the very first genotype in our examples AG becomes AB.