I rarely do GWAS, but from time to time we observe strange things. Some QQplots align nicely:
but others do not (this below is sheep):
We think that in reality, the distribution of the test statistic under the null (SNP effect estimate / s.e. ) may not in fact follow a normal distribution, but something else, due to violations of assumptions. As a aconsequence, p-values do not follow a uniform under H0. For instance, markers are in LD and they capture information from their neighbours, animals are related (even if this is considered), traits are selected, and, essentially, models are not perfect. But we don’t really know - this is just speculation.
A method for correction that is quite popular in Human Genetics, is called Genomic Control (Devlin and Roeder 1999, Devlin et al. 2004). The method was designed to deal with hidden relationships but it is useful for us too. This is a crude method (but very popular) that simply corrects the p-values to make the qqplot fit the expected lines.
So let’s say that x are your -log10(p-values). then you estimate the slope of x, for the points that seem to be under the null (for instance, x[x<=3] in R parlance ). This gives you a slope, say 0.8. Then you multiply all x by the inverse of this slope, e.g. x <- x/0.8 and x now contains “corrected” -log10p-values. By construction, the qqplot will align well.
I think that this is not very scientific but it is current practice.
Below an R program with simulated p-values (some of them frankly extreme) and in one of the qqplot functions I introduced Genomic control (lines “AL”). At the end you obtain a Figure without:
or with genomic control:
# example of genomic control # AL, 17 Feb 2022 set.seed(1234) myQQplot = function(pvalues){ # agnostic qq plot using only observed data # https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics my.pvalues=pvalues exp.pvalues<-(rank(my.pvalues, ties.method="first")+.5)/(length(my.pvalues)+1) par(pty="s") plot(-log10(exp.pvalues), -log10(my.pvalues), xlim=c(0,10),ylim=c(0,10)) abline(0,1) par(pty="m") } # qqplot assuming that p-values follow a uniform distribution (what we want) # https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics library(lattice) qqunif.plot<-function(pvalues, GC=FALSE, should.thin=T, thin.obs.places=2, thin.exp.places=2, xlab=expression(paste("Expected (",-log[10], " p-value)")), ylab=expression(paste("Observed (",-log[10], " p-value)")), draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05, already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif, par.settings=list(superpose.symbol=list(pch=pch)), ...) { #error checking if (length(pvalues)==0) stop("pvalue vector is empty, can't draw plot") if(!(class(pvalues)=="numeric" || (class(pvalues)=="list" && all(sapply(pvalues, class)=="numeric")))) stop("pvalue vector is not numeric, can't draw plot") if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot") if (already.transformed==FALSE) { if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot") } else { if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot") } grp<-NULL n<-1 exp.x<-c() if(is.list(pvalues)) { nn<-sapply(pvalues, length) rs<-cumsum(nn) re<-rs-nn+1 n<-min(nn) if (!is.null(names(pvalues))) { grp=factor(rep(names(pvalues), nn), levels=names(pvalues)) names(pvalues)<-NULL } else { grp=factor(rep(1:length(pvalues), nn)) } pvo<-pvalues pvalues<-numeric(sum(nn)) exp.x<-numeric(sum(nn)) for(i in 1:length(pvo)) { if (!already.transformed) { pvalues[rs[i]:re[i]] <- -log10(pvo[[i]]) exp.x[rs[i]:re[i]] <- -log10((rank(pvo[[i]], ties.method="first")-.5)/nn[i]) } else { pvalues[rs[i]:re[i]] <- pvo[[i]] exp.x[rs[i]:re[i]] <- -log10((nn[i]+1-rank(pvo[[i]], ties.method="first")-.5)/(nn[i]+1)) } } } else { n <- length(pvalues)+1 if (!already.transformed) { exp.x <- -log10((rank(pvalues, ties.method="first")-.5)/n) pvalues <- -log10(pvalues) } else { exp.x <- -log10((n-rank(pvalues, ties.method="first")-.5)/n) } } # AL GENOMIC CONTROL modification # we have pvalues wich are in fact -log10pvalues # and exp.x (expected -log10pvalues) # we estimate the regression of one on the other linReg=lm(pvalues ~ 0 + exp.x) print(" linear regression observed -log10pval on expected") print(summary(linReg)) if(GC){ # estimate coefficient under the null linReg=lm(pvalues[pvalues<=3] ~ 0 + exp.x[pvalues<=3]) print(" linear regression observed -log10pval on expected ONLY for -log10pval <3") print(summary(linReg)) b=linReg$coefficients[1] cat("b",b,"\n") #correct pvalues=pvalues/b } ## AL end of the modification #this is a helper function to draw the confidence interval panel.qqconf<-function(n, conf.points=1000, conf.col="gray", conf.alpha=.05, ...) { require(grid) conf.points = min(conf.points, n-1); mpts<-matrix(nrow=conf.points*2, ncol=2) for(i in seq(from=1, to=conf.points)) { mpts[i,1]<- -log10((i-.5)/n) mpts[i,2]<- -log10(qbeta(1-conf.alpha/2, i, n-i)) mpts[conf.points*2+1-i,1]<- -log10((i-.5)/n) mpts[conf.points*2+1-i,2]<- -log10(qbeta(conf.alpha/2, i, n-i)) } grid.polygon(x=mpts[,1],y=mpts[,2], gp=gpar(fill=conf.col, lty=0), default.units="native") } #reduce number of points to plot if (should.thin==T) { if (!is.null(grp)) { thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places), exp.x = round(exp.x, thin.exp.places), grp=grp)) grp = thin$grp } else { thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places), exp.x = round(exp.x, thin.exp.places))) } pvalues <- thin$pvalues exp.x <- thin$exp.x } gc() prepanel.qqunif= function(x,y,...) { A = list() A$xlim = range(x, y)*1.02 A$xlim[1]=0 A$ylim = A$xlim return(A) } #draw the plot xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect, prepanel=prepanel, scales=list(axs="i"), pch=pch, panel = function(x, y, ...) { if (draw.conf) { panel.qqconf(n, conf.points=conf.points, conf.col=conf.col, conf.alpha=conf.alpha) }; panel.xyplot(x,y, ...); panel.abline(0,1); }, par.settings=par.settings, ... ) } # simulate p-values under the null set.seed(1234) pvalOK=runif(40000) hist(pvalOK) myQQplot(pvalOK) qqunif.plot(pvalOK) #simulate *wrong* p-values pvalNotOK=rbeta(40000,.5,2) hist(pvalNotOK) myQQplot(pvalNotOK) qqunif.plot(pvalNotOK,GC=FALSE) # genomic control on these p-values looks good !! qqunif.plot(pvalNotOK,GC=TRUE)
No comments:
Post a Comment