Friday, December 22, 2017

Genetic Variances in an observed population

To consider the genetic variance of a population we resort more and more to the genetic variance of an observed population, as opposed to the genetic variance of an ideal population. For instance, the second does not consider neither linkage disequilibrium nor deviations of Hardy-Weinberg equilibrium. Thus the first may be more useful or accurate.

The (empirical) genetic variance of one population composed of $n$ individuals is simply the (empirical) variance of its genetic values:

$latex var(u)=\frac{1}{n}{\sum_i (u_i^2)}-(\frac{1}{n}{\sum_i(u_i)})^2 $latex

This definition does not consider any genetic structure among the observed individuals. However, because genetic values are unknown, we need to estimate them, typically in a Mixed Model context.

To my knowledge, there are two methods to do this. The first is to estimate a variance component (say $latex \sigma^2_u $latex) , that actually refers to an infinite, unrelated population. Then, based on this variance component, we can estimate the expected value of the empirical $var(u)$ over the multivariate distribution $latex Var(\bf u)=\bf K \sigma^2_u$latex  where $latex \bf K$latex  is a relationship matrix. This results in "my" estimator:

$latex E(var(u))=D_K \sigma^2_u = (mean(diag({\bf K}))-mean({\bf K}))\sigma^2_u $latex

that is becoming increasingly used (and not only by ourselves 😅 ), among other things, because it is extremely simple.

A more refined approach was developed by Sorensen et al. (this is in my opinion Daniel's best paper) and recently revived in the context of genomic prediction by Lehermeier et al. . The estimator is Bayesian by nature, and it constructs the posterior distribution of $latex var(u)$latex  from samples of the genetic values $\bf u$, i.e., at an iteration of the Gibbs sampler the sample of $latex var(u)$latex  is

$latex \widetilde{var(u)}=\frac{1}{n}{\sum_i (\tilde u_i^2)}-(\frac{1}{n}{\sum_i(\tilde u_i)})^2 $latex

This is well described in BGLR. In fact "my" method is a simplification of Sorensen et al. method.

As a frequent blupf90 suite user, I want to implement this other approach in Gibbs2f90. It is actually quite simple without modifying the code:


  1. write to output samples of genetic values $latex \bf u$latex  using OPTION solution all in the parameter file
  2. post-process these samples to obtain samples of $latex \widetilde{var(u)}$latex  using  $latex \widetilde{var(u)}=\frac{1}{n}{\sum_i (\tilde u_i^2)}-(\frac{1}{n}{\sum_i(\tilde u_i)})^2$latex 
  3. get an estimates (i.e. the mean) from the samples of $latex \widetilde{var(u)}$latex 

I have re-analysed using Sorensen et al.  the mice data set used in "my" estimator to obtain estimates of genetic variance. Using four kind of relationships: Identity by State, VanRaden's G, pedigree relationships and Kernel similarities. The estimates (posterior means) of genetic variance $var(u)$ are  :
  • IBS: 0.037
  • G: 0.035
  • Kernel: 0.050
  • pedigree: 0.035
which compare well with my previous estimates of 0.036, 0.033, 0.047 and 0.038. The larger value of the kernel estimator is because it includes both additive and dominant relationships. When do we expect to have real differences? For small, very informative data sets like  Lehermeier et al. data. In which case Sorensen et al. method is better.