Sunday, May 22, 2022

GWAS and QQplots and Genomic Control

 I rarely do GWAS, but from time to time we observe strange things. Some QQplots align nicely:


but others do not (this below is sheep):



We think that in reality, the distribution of the test statistic under the null (SNP effect estimate / s.e. ) may not in fact follow a normal distribution, but something else, due to violations of assumptions. As a aconsequence, p-values do not follow a uniform under H0. For instance, markers are in LD and they capture information from their neighbours, animals are related (even if this is considered), traits are selected, and, essentially, models are not perfect. But we don’t really know - this is just speculation.

A method for correction that is quite popular in Human Genetics, is called Genomic Control (Devlin and Roeder 1999, Devlin et al. 2004). The method was designed to deal with hidden relationships but it is useful for us too. This is a crude method (but very popular) that simply corrects the p-values to make the qqplot fit the expected lines.

So let’s say that x are your -log10(p-values). then you estimate the slope of x, for the points that seem to be under the null (for instance, x[x<=3] in R parlance ). This gives you a slope, say 0.8. Then you multiply all x by the inverse of this slope, e.g. x <- x/0.8  and x now contains “corrected” -log10p-values. By construction, the qqplot will align well.

I think that this is not very scientific but it is current practice.

Below an R program with simulated p-values (some of them frankly extreme) and in one of the qqplot functions I introduced Genomic control (lines “AL”).  At the end you obtain a Figure without:




 or with genomic control:





# example of genomic control
# AL, 17 Feb 2022
set.seed(1234)
myQQplot = function(pvalues){
  # agnostic qq plot using only observed data 
  # https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics
  my.pvalues=pvalues
  exp.pvalues<-(rank(my.pvalues, ties.method="first")+.5)/(length(my.pvalues)+1)
  par(pty="s")
  plot(-log10(exp.pvalues), -log10(my.pvalues), xlim=c(0,10),ylim=c(0,10))
  abline(0,1)
  par(pty="m")
}


# qqplot assuming that p-values follow a uniform distribution (what we want)
# https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics

library(lattice)
qqunif.plot<-function(pvalues,
                      GC=FALSE,
                      should.thin=T, thin.obs.places=2, thin.exp.places=2, 
                      xlab=expression(paste("Expected (",-log[10], " p-value)")),
                      ylab=expression(paste("Observed (",-log[10], " p-value)")), 
                      draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
                      already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif,
                      par.settings=list(superpose.symbol=list(pch=pch)), ...) {
  
  
  #error checking
  if (length(pvalues)==0) stop("pvalue vector is empty, can't draw plot")
  if(!(class(pvalues)=="numeric" || 
       (class(pvalues)=="list" && all(sapply(pvalues, class)=="numeric"))))
    stop("pvalue vector is not numeric, can't draw plot")
  if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
  if (already.transformed==FALSE) {
    if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
  } else {
    if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot")
  }
  
  
  grp<-NULL
  n<-1
  exp.x<-c()
  if(is.list(pvalues)) {
    nn<-sapply(pvalues, length)
    rs<-cumsum(nn)
    re<-rs-nn+1
    n<-min(nn)
    if (!is.null(names(pvalues))) {
      grp=factor(rep(names(pvalues), nn), levels=names(pvalues))
      names(pvalues)<-NULL
    } else {
      grp=factor(rep(1:length(pvalues), nn))
    }
    pvo<-pvalues
    pvalues<-numeric(sum(nn))
    exp.x<-numeric(sum(nn))
    for(i in 1:length(pvo)) {
      if (!already.transformed) {
        pvalues[rs[i]:re[i]] <- -log10(pvo[[i]])
        exp.x[rs[i]:re[i]] <- -log10((rank(pvo[[i]], ties.method="first")-.5)/nn[i])
      } else {
        pvalues[rs[i]:re[i]] <- pvo[[i]]
        exp.x[rs[i]:re[i]] <- -log10((nn[i]+1-rank(pvo[[i]], ties.method="first")-.5)/(nn[i]+1))
      }
    }
  } else {
    n <- length(pvalues)+1
    if (!already.transformed) {
      exp.x <- -log10((rank(pvalues, ties.method="first")-.5)/n)
      pvalues <- -log10(pvalues)
    } else {
      exp.x <- -log10((n-rank(pvalues, ties.method="first")-.5)/n)
    }
  }
  
  # AL GENOMIC CONTROL modification
  # we have pvalues wich are in fact -log10pvalues
  # and exp.x (expected -log10pvalues)
  # we estimate the regression of one on the other
  linReg=lm(pvalues ~ 0 + exp.x)
  print(" linear regression observed -log10pval on expected")
  print(summary(linReg))
  if(GC){
    # estimate coefficient under the null
    linReg=lm(pvalues[pvalues<=3] ~ 0 + exp.x[pvalues<=3])
    print(" linear regression observed -log10pval on expected ONLY for -log10pval <3")
    print(summary(linReg))
    b=linReg$coefficients[1]
    cat("b",b,"\n")
    #correct
    pvalues=pvalues/b
  }
  ## AL end of the modification
  
  #this is a helper function to draw the confidence interval
  panel.qqconf<-function(n, conf.points=1000, conf.col="gray", conf.alpha=.05, ...) {
    require(grid)
    conf.points = min(conf.points, n-1);
    mpts<-matrix(nrow=conf.points*2, ncol=2)
    for(i in seq(from=1, to=conf.points)) {
      mpts[i,1]<- -log10((i-.5)/n)
      mpts[i,2]<- -log10(qbeta(1-conf.alpha/2, i, n-i))
      mpts[conf.points*2+1-i,1]<- -log10((i-.5)/n)
      mpts[conf.points*2+1-i,2]<- -log10(qbeta(conf.alpha/2, i, n-i))
    }
    grid.polygon(x=mpts[,1],y=mpts[,2], gp=gpar(fill=conf.col, lty=0), default.units="native")
  }
  
  #reduce number of points to plot
  if (should.thin==T) {
    if (!is.null(grp)) {
      thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
                                exp.x = round(exp.x, thin.exp.places),
                                grp=grp))
      grp = thin$grp
    } else {
      thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
                                exp.x = round(exp.x, thin.exp.places)))
    }
    pvalues <- thin$pvalues
    exp.x <- thin$exp.x
  }
  gc()
  
  prepanel.qqunif= function(x,y,...) {
    A = list()
    A$xlim = range(x, y)*1.02
    A$xlim[1]=0
    A$ylim = A$xlim
    return(A)
  }
  
  #draw the plot
  xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
         prepanel=prepanel, scales=list(axs="i"), pch=pch,
         panel = function(x, y, ...) {
           if (draw.conf) {
             panel.qqconf(n, conf.points=conf.points, 
                          conf.col=conf.col, conf.alpha=conf.alpha)
           };
           panel.xyplot(x,y, ...);
           panel.abline(0,1);
         }, par.settings=par.settings, ...
  )
}

# simulate p-values under the null
set.seed(1234)
pvalOK=runif(40000)
hist(pvalOK)
myQQplot(pvalOK)
qqunif.plot(pvalOK)

#simulate *wrong* p-values 
pvalNotOK=rbeta(40000,.5,2)
hist(pvalNotOK)
myQQplot(pvalNotOK)
qqunif.plot(pvalNotOK,GC=FALSE)
# genomic control on these p-values looks good !!
qqunif.plot(pvalNotOK,GC=TRUE)