artadia

Friday, December 12, 2025

two simple python programs

This program codes numerically unknown parent groups from a pedigree file where missing parents have codes "GR", and there is date of birth of the animal in the 4th column

count_upg={}
count_new_upg={}

def define_group(yob):
    if yob< 1995:
        yob=1995
    # The last year is 2022, check and change in new files
    out= -1 + (yob - 2022 -1) // 5
    return out

fin=open("pedigree.txt", "r")
fout=open("pedigreeUPG.txt","w")
for line in fin:
    row=line.split()
    for col in (1,2):
        if "GR" in row[col]:
            #It is a group 
            my_upg=row[col]
            if my_upg in count_upg:
                count_upg[my_upg]+=1
            else:
                count_upg[my_upg]=1
            #assign new group
            yob=int(row[3])
            new_group=define_group(yob)
            if new_group in count_new_upg:
                count_new_upg[new_group]+=1
            else:
                count_new_upg[new_group]=1
            row[col]=new_group
    print(*row,file=fout)
  
            
for k,v in count_upg.items():
    print(k,v)

for k,v in count_new_upg.items():
    print(k,v)

and this other program reads a pedigree file, stores the animals, and then reads a genotype file one line at a time. The genotype line is written to file if the animal is present in the pedigree file.

list_animals=set()

fin=open("pedigreeUPG.txt","r")
for line in fin:
    row=line.split()
    anim=row[0]
    list_animals.add(anim)
print("list animals",len(list_animals))

n=0
fin=open("genotypes.txt","r")
fout=open("genotypes.txt.clean","w")
for line  in fin:
    row=line.split()
    anim=row[0]
    if anim in list_animals:
        n+=1
        print(line.strip(),file=fout)
print(n)

Friday, September 5, 2025

How to find out Intel MKL version

First, find out your installation:

$ echo $MKLROOT

/opt/intel/oneapi/mkl/2023.0.0

Go there and find the file mkl_version.h :

$ find ./ -name *version*

look there:

#define INTEL_MKL_VERSION 20230000

or, you can write a program to find out:

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2025-1/mkl-get-version-string.html

$ vim mklversion.f90

!$ use omp_lib

!https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2025-1/mkl-get-version-string.html

character(198):: buf

call mkl_get_version_string(buf)

write(*,'(a)') buf

end

$ ifort -qmkl mklversion.f90

$ ./a.out

Intel(R) oneAPI Math Kernel Library Version 2023.0-Product Build 20221128 for Intel(R) 64 architecture applications

Tuesday, July 29, 2025

line end Windows versus others

Windows users that introduce their files in Unix (Linux, Mac) systems sometimes have lots of unexpected problems due to end of line codings. You need to think that a file is just a long stream in which it happens that some hidden characters represent end-of-line. The end of lines is represented with different conventions in the Windows (DOS) and Unix (Linux, Mac) worlds as can be seen here:

https://stackoverflow.com/questions/1552749/difference-between-cr-lf-lf-and-cr-line-break-types

Windows uses CR LF and Unix uses LF. OLd Mac uses CR but modern versions use LF.

How can I see these differences?

- the Windows editor Notepad++ shows them

- the Mac editor Visual Studio Code shows its existence here:

- vim shows the type when you open the file:

- the file command shows it too

$ file data.txt

data.txt: ASCII text, with CRLF line terminators

$ file extractDescendants.py

extractDescendants.py: Python script, ASCII text executable

because the last one doesn't say CRLF, then it's LF so it's Unix format.

How can I convert among formats?

Most often you want to convert from Windows (CRLF) to Linux (LF) .

From memory, Notepad++ allows you to save in Linux format.

In Visual Studio Code, click on the bottom sign "CRLF" and see appear a menu that allows to change it:

In the command line:

many servers have the programs flip and dos2unix that allow to do that:

$ flip -h

Usage: flip [-t|-u|-d|-m] filename[s]

Converts ASCII files between Unix, MS-DOS/Windows, or Macintosh newline formats

Options:

-u = convert file(s) to Unix newline format (newline)

-d = convert file(s) to MS-DOS/Windows newline format (linefeed + newline)

-m = convert file(s) to Macintosh newline format (linefeed)

-t = display current file type, no file modifications

$ dos2unix -h

dos2unix 6.0.3 (2013-01-25)

Usage: dos2unix [options] [file ...] [-n infile outfile ...]

if you don't have any of this, you can install flip from here:

https://ccrma.stanford.edu/~craig/utility/flip/

This is how we compiled it for MacOs:

g++ -ansi -O3 flip.cpp -o flip

You only need to use flip -u myfile and it will modify in-place my file to Unix format. (You DON'T need to use flip -m because that Mac format is obsolete now.)

For instance:

$ file mydata.txt

mydata.txt: ASCII text, with CRLF line terminators

$ flip -u mydata.txt

$ file mydata.txt

mydata.txt: ASCII text

Tuesday, March 25, 2025

handling python files in parallel

ar=[str(x) for x in range(0,5)]

print(*ar)

handles = [open('file'+x, 'w') for x in ar] # not using generator

#https://stackoverflow.com/questions/1747817/create-a-dictionary-with-comprehension

myHandles=dict(zip(ar,handles))

for i in range(1,100):

myf=str(i%5)

print("a",i,file=myHandles[myf])

Wednesday, November 27, 2024

quick filter in awk

make a list of animals in col2 of 1st file, then print 2nd file if animals ar NOT in the list

awk 'FNR==NR{inn[$2]; next} !($2 in inn)' test5 relGenApprox

Friday, October 18, 2024

python like awk

so I want to put Python in a pipe where it reads from std in , does something, writes to std out:

$ cat test.py

#!/usr/bin/env python3

import sys

# https://stackoverflow.com/questions/1450393/how-do-i-read-from-stdin

for n,line in enumerate(sys.stdin):

a=line.split()

if n%5==0:

sys.stdout.write(line)

Friday, March 1, 2024

extract UPG allocation from renumf90 log

so you want to know from the output of renumf90 (say ren.log) how many animals were allocated to each UPG?

The output looks like this:

Unknown parent group allocation

Equation Group #Animals

59815785 1 21349

...

Max group = 424; Max UPG ID = 59816208

...

Use this in the command line:

$ sed -n '/allocation/,/Max\ group/p' ren.log | awk '$1 ~ /^[0-9]+$/'

e.g.

59815785 1 21349

59815786 2 4615

explanations:

find between two patterns:

https://askubuntu.com/a/849016

check if the 1st column is a (positive) integer:

https://stackoverflow.com/questions/28878995/check-if-a-field-is-an-integer-in-awk

Tuesday, January 23, 2024

Dictionary of arrays in Julia and breed fractions

So in Julia I want to create a dictionary (hash table or associative array) of arrays to store breed composition in dairy cattle. It took me a while but I found out how to declare it:

julia> a=Dict{String,Array{Float64,1}}()

Imagine the following pedigree

A 0 0 Holstein
B 0 0 Jersey
C A B
D A C

then this Julia script computes breed fractions

#=
A 0 0 Holstein
B 0 0 Jersey
C A B
D A C
=#

breedcomp=Dict{String,Array{Float64,1}}()

#purebred founders
breedcomp["A"]=[1.0,0.0]
breedcomp["B"]=[0.0,1.0]
#rest of pedigree
breedcomp["C"]=0.5*(breedcomp["A"]+breedcomp["B"])
breedcomp["D"]=0.5*(breedcomp["A"]+breedcomp["C"])

display(breedcomp)

Dict{String, Vector{Float64}} with 4 entries:
  "B" => [0.0, 1.0]
  "A" => [1.0, 0.0]
  "C" => [0.5, 0.5]
  "D" => [0.75, 0.25]

Monday, October 23, 2023

SNP effects from Single Step GBLUP with APY

APY is a technique that allows representing the inverse of a genomic relationship matrix in a sparse format by choosing a "core" of animals (here and here). The authoritative guide to APY is Bermann et al. 2022: this paper .

One of the key aspects of genomic models is the ability of estimating SNP effects and then apply them to newly genotyped animals, what is know as Indirect Predictions. Matias and I found out that it's easier than we thought as shown here.

Among many other things Bermann et al. show that one can write indirect predictions of "non-core" animals as (I use the original equation numbering)

eq. (10) : $latex \mathbf{{u}_n}= \mathbf{Z}_n \mathbf{Z}'_c (\mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{{a}} + \boldsymbol{\xi} $latex

where $latex \mathbf{{a}} $latex are SNP effects and $latex \boldsymbol{\xi} $latex is an error term that does not depend on $latex \mathbf{{u}_c} $latex.

we obtain SNP effects from eq. 21 in Bermann et al:

$latex \mathbf{\hat{a}}=k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c} $latex

we plug that into (10) and expand:

$latex \mathbf{\hat{u}_n}= \mathbf{Z}_n \mathbf{Z}'_c (\mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{\hat{a}} =k \mathbf{Z}_n \mathbf{Z}'_c (k \mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c \mathbf{\hat{a}} $latex

we substitute for $latex \mathbf{\hat{a}} $latex:

$latex \mathbf{\hat{u}_n} =k \mathbf{Z}_n \mathbf{Z}'_c (k \mathbf{Z}_c \mathbf{Z}'_c)^{-1} \mathbf{Z}_c k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c}=k \mathbf{Z}_n \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c} = \mathbf{Z}_n \mathbf{\hat{a}} $latex

which is the original eq. 21. This is because $latex \hat{a} $latex is part of the column space of $latex \mathbf{Z}'_c $latex, then $latex \mathcal{P}\mathbf{\hat{a}} = \mathbf{\hat{a}} $latex.

Finally, obtaining Indirect Predictions from APY is deadly simple and intuitive:

$latex \mathbf{\hat{a}}=k \mathbf{Z}'_c \mathbf{G}_{cc}^{-1} \mathbf{\hat{u}}_{c} $latex
$latex \mathbf{\hat{u}_n} = \mathbf{Z}_n \mathbf{\hat{a}} $latex

Thursday, May 25, 2023

compiling macs in MacBook with Ventura

I wanted to compile the coalescent simulator macs in the Mac running Ventura with M2 chips. This was of course the beginning of a great adventure. I end up doing the following:

- install library boost using homebrew

- dig out the path for the boost library e.g. as here, which turned out to be /opt/homebrew/Cellar/boost/1.81.0_1/include

- finally, modify the makefile as follows

# compile options
CFLAGS = -Wall -g
#CFLAGS = -Wall -O3
# Add location of any library locations below with -L 
LINKFLAGS = 
#LINKFLAGS = -static 

# compiler
CC = g++

# libraries.  For a local Boost installation
# Example:
#LIB = -I /Users/garychen/software/boost_1_36_0
# Default:
#LIB = -I .
LIB = -I /opt/homebrew/Cellar/boost/1.81.0_1/include

# simulator name
SIM=macs

OBJS = simulator.o algorithm.o datastructures.o

$(SIM)  : $(OBJS) 
    $(CC) -o $(SIM) $(OBJS) $(LINKFLAGS)

simulator.o: simulator.cpp simulator.h
    $(CC) $(CFLAGS) $(LIB) -c $<

algorithm.o: algorithm.cpp simulator.h
    $(CC) $(CFLAGS) $(LIB) -c $<

datastructures.o: datastructures.cpp simulator.h
    $(CC) $(CFLAGS) $(LIB) -c $<

Thursday, February 2, 2023

draw many elements julia

# a vector of frequencies

p=rand(Beta(2,2),10)

# a vector of "distributions"

a=Binomial.(2,p)

# a vector of vectors

b=rand.(a,1)

# collapsed (row vector)

reduce(hcat,b)

# draw 5 animals at once

b=rand.(a,5)

# collapse

bb=reduce(hcat,b)

#compute frequencies

pest = mean(bb,dims=1)/2

Tuesday, October 4, 2022

Colleau's (2002) algorithm for dummies

Colleau (1992) realized that computations associated with matrix the product Ax or similar products (in different notations) TDT'x , T'x , Tx etc, where T is a triangular matrix that in the i-th row contains the fractions of individual i that come from all preceding individuals, could be easily done by solving instead of computing the whole matrix A. As done for instance in Aguilar et al. 2011 ; this is in fact a key algorithm in implementation of ssGBLUP.

Whereas matrix T is a bit complicated to compute, it was realized (in the 70's ?) that $latex T^{-1} $latex had a very simple structure: $latex T^{-1} = I -P $latex where P contains 0.5 in the (individual, sire) and (individual, dam) locations and 0 otherwise - see Quaas (1988) for explanations.

A word on notation: Colleau calls T the matrix that links animals to parents; Quaas calls it P. In my opinion Quaas notation is more popular and I stick to it.

The matrix of relationship can be shown to be

$latex A = (I - P)^{- 1}D{(I - P)^{- 1}}^{'} $latex

With inverse

$latex A^{- 1} = (I - P)^{'}D^{- 1}(I - P) $latex

Where $latex P $latex contains, in the i-th row, values of 0.5 for sire(i) and dam(i). This represents the expression $latex u_{i} = u_{s} + u_{d} + \phi $latex. Matrix $latex D $latex is diagonal and contains the variance of the mendelian sampling, i.e. $latex D_{ii} = 0.5 -0.25 \left( F_{s} + F_{d} \right) $latex with Meuwissen and Luo (1992) presenting, in the case of unknown ancestor, $latex s = 0 $latex (or $latex d = 0 $latex ), their programming set $latex F_{0} = - 1 $latex.

To help ideas, a small pedigree due to Kempthorne is as follows :

A 0 0

B 0 0

D A B

E A D

F B E

Z A B

Renumbered as

1 0 0

2 0 0

3 1 2

4 1 2

5 1 3

6 2 5

The corresponding matrices are

$latex P = \begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0 & 0.5 & 0 \\ \end{pmatrix} $latex

$latex D = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0.5 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0.5 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.4375 \\ \end{pmatrix} $latex

With $latex F = \begin{pmatrix} 0 & 0 & 0 & 0 & 0.25 & 0.125 \\ \end{pmatrix} $latex

And finally , A=

\[,1\] \[,2\] \[,3\] \[,4\] \[,5\] \[,6\]

\[1,\] 1.000 0.000 0.500 0.5 0.75 0.375

\[2,\] 0.000 1.000 0.500 0.5 0.25 0.625

\[3,\] 0.500 0.500 1.000 0.5 0.75 0.625

\[4,\] 0.500 0.500 0.500 1.0 0.50 0.500

\[5,\] 0.750 0.250 0.750 0.5 1.25 0.750

\[6,\] 0.375 0.625 0.625 0.5 0.75 1.125

The idea of Colleau is to use this decomposition to compute quickly products of the form $latex x = Av $latex solving the system of equations $latex A^{- 1}x = v $latex as $latex (I - P)^{'}D^{- 1}(I - P)x = v $latex from left to right in three steps:

1. Solve $latex (I - P)^{'}a = v $latex

2. Solve $latex D^{- 1}b = a $latex

3. Solve $latex (I - P)x = b $latex

To solve (1) we use the special structure of

$latex (I - P)^{'} = \begin{pmatrix} 1 & 0 & - 0.5 & - 0.5 & - 0.5 & 0 \\ 0 & 1 & - 0.5 & - 0.5 & 0 & - 0.5 \\ 0 & 0 & 1 & 0 & - 0.5 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 0.5 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} $latex

To solve the system

$latex \begin{pmatrix} 1 & 0 & - 0.5 & - 0.5 & - 0.5 & 0 \\ 0 & 1 & - 0.5 & - 0.5 & 0 & - 0.5 \\ 0 & 0 & 1 & 0 & - 0.5 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 0.5 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix} \begin{pmatrix} a_{1} \\ a_{2} \\ \ldots \\ \\ \\ a_{6} \\ \end{pmatrix} = \begin{pmatrix} v_{1} \\ v_{2} \\ \ldots \\ \\ \\ v_{6} \\ \end{pmatrix} $latex

Also, for each row, the locations of the $latex - 0.5 $latex are simply its sire and dam. For instance, the solution of $latex a_{2} $latex is $latex a_{2} = v_{2} + 0.5(a_{3} + a_{4} + a_{6}) $latex, in other words, the value of $latex v_{2} $latex plus 0.5 times the values of its offspring. Then, we can solve from the bottom to the top and adding "contributions" of +0.5 the value of "a" to the ancestors of animal i , and once we are in row i there is no more modifications to make to animal i **if ancestors are coded before offspring**. As follows:

a=0
do i=nanim,1,-1
   a(i)=a(i)+v(i)
   if (s(i)>0) a(s(i))=a(s(i))+0.5*a(i)
   if (d(i)>0) a(d(i))=a(d(i))+0.5*a(i)
enddo

To solve (2) there are two options, one is to precompute the diagonal of D and then

b=0
b=D*a

Another option is to compute D on the run :

! previously set F(0) to 0
do i=1,nanim
dii=0.5-0.25*(F(s(i)+F(d(i))
b(i)=a(i)*dii
enddo

To solve (3) we use the special structure of

$latex I - P = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ - 0.5 & - 0.5 & 1 & 0 & 0 & 0 \\ - 0.5 & - 0.5 & 0 & 1 & 0 & 0 \\ - 0.5 & 0 & - 0.5 & 0 & 1 & 0 \\ 0 & - 0.5 & 0 & 0 & - 0.5 & 1 \\ \end{pmatrix} $latex

To solve the system

$latex \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ - 0.5 & - 0.5 & 1 & 0 & 0 & 0 \\ - 0.5 & - 0.5 & 0 & 1 & 0 & 0 \\ - 0.5 & 0 & - 0.5 & 0 & 1 & 0 \\ 0 & - 0.5 & 0 & 0 & - 0.5 & 1 \\ \end{pmatrix}\begin{pmatrix} x_{1} \\ x_{2} \\ \ldots \\ \\ \\ x_{6} \\ \end{pmatrix} = \begin{pmatrix} b_{1} \\ b_{2} \\ \ldots \\ \\ \\ b_{6} \\ \end{pmatrix} $latex

Such that we can solve from the "top" to the "bottom".

x=0
do i=1,nanim
   x(i)=b(i)
   if(s(i)/=0) then
     x(i)=b(i)+0.5*x(s(i))
   endif
  if(d(i)/=0) then
     x(i)=b(i)+0.5*x(d(i))
   endif
enddo

with these three steps we are done.

Thursday, August 25, 2022

Compilation options

Another reminder for myself. Compilation options for Intel Fortran and gfortran:

Intel Fortran:

Optimization: -O3

Debug: -g -traceback -check all -fpe0 -check noarg_temp_created

(see here (in English) and here (in French) for meaning)

the option -o tells the name of the produced executable

i.e. the command is something like ifort -O3 myprog.f90 -o myprog

I often use also -heap-arrays to avoid using unlimit to manipulate the stack size: see here

gfortran:

Optimization: -O3

Debug: -g -Wall -fbounds-check -fbacktrace -ffpe-trap=invalid,zero,overflow

Detailed explanations here and here (page 24-27)

I often use the option -ffree-line-length-none to allow for lines of any length

Sunday, May 22, 2022

GWAS and QQplots and Genomic Control

I rarely do GWAS, but from time to time we observe strange things. Some QQplots align nicely:

but others do not (this below is sheep):

We think that in reality, the distribution of the test statistic under the null (SNP effect estimate / s.e. ) may not in fact follow a normal distribution, but something else, due to violations of assumptions. As a aconsequence, p-values do not follow a uniform under H0. For instance, markers are in LD and they capture information from their neighbours, animals are related (even if this is considered), traits are selected, and, essentially, models are not perfect. But we don’t really know - this is just speculation.

A method for correction that is quite popular in Human Genetics, is called Genomic Control (Devlin and Roeder 1999, Devlin et al. 2004). The method was designed to deal with hidden relationships but it is useful for us too. This is a crude method (but very popular) that simply corrects the p-values to make the qqplot fit the expected lines.

So let’s say that x are your -log10(p-values). then you estimate the slope of x, for the points that seem to be under the null (for instance, x[x<=3] in R parlance ). This gives you a slope, say 0.8. Then you multiply all x by the inverse of this slope, e.g. x <- x/0.8 and x now contains “corrected” -log10p-values. By construction, the qqplot will align well.

I think that this is not very scientific but it is current practice.

Below an R program with simulated p-values (some of them frankly extreme) and in one of the qqplot functions I introduced Genomic control (lines “AL”). At the end you obtain a Figure without:

or with genomic control:

# example of genomic control
# AL, 17 Feb 2022
set.seed(1234)
myQQplot = function(pvalues){
  # agnostic qq plot using only observed data 
  # https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics
  my.pvalues=pvalues
  exp.pvalues<-(rank(my.pvalues, ties.method="first")+.5)/(length(my.pvalues)+1)
  par(pty="s")
  plot(-log10(exp.pvalues), -log10(my.pvalues), xlim=c(0,10),ylim=c(0,10))
  abline(0,1)
  par(pty="m")
}


# qqplot assuming that p-values follow a uniform distribution (what we want)
# https://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R#R_Base_Graphics

library(lattice)
qqunif.plot<-function(pvalues,
                      GC=FALSE,
                      should.thin=T, thin.obs.places=2, thin.exp.places=2, 
                      xlab=expression(paste("Expected (",-log[10], " p-value)")),
                      ylab=expression(paste("Observed (",-log[10], " p-value)")), 
                      draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
                      already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif,
                      par.settings=list(superpose.symbol=list(pch=pch)), ...) {
  
  
  #error checking
  if (length(pvalues)==0) stop("pvalue vector is empty, can't draw plot")
  if(!(class(pvalues)=="numeric" || 
       (class(pvalues)=="list" && all(sapply(pvalues, class)=="numeric"))))
    stop("pvalue vector is not numeric, can't draw plot")
  if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
  if (already.transformed==FALSE) {
    if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
  } else {
    if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot")
  }
  
  
  grp<-NULL
  n<-1
  exp.x<-c()
  if(is.list(pvalues)) {
    nn<-sapply(pvalues, length)
    rs<-cumsum(nn)
    re<-rs-nn+1
    n<-min(nn)
    if (!is.null(names(pvalues))) {
      grp=factor(rep(names(pvalues), nn), levels=names(pvalues))
      names(pvalues)<-NULL
    } else {
      grp=factor(rep(1:length(pvalues), nn))
    }
    pvo<-pvalues
    pvalues<-numeric(sum(nn))
    exp.x<-numeric(sum(nn))
    for(i in 1:length(pvo)) {
      if (!already.transformed) {
        pvalues[rs[i]:re[i]] <- -log10(pvo[[i]])
        exp.x[rs[i]:re[i]] <- -log10((rank(pvo[[i]], ties.method="first")-.5)/nn[i])
      } else {
        pvalues[rs[i]:re[i]] <- pvo[[i]]
        exp.x[rs[i]:re[i]] <- -log10((nn[i]+1-rank(pvo[[i]], ties.method="first")-.5)/(nn[i]+1))
      }
    }
  } else {
    n <- length(pvalues)+1
    if (!already.transformed) {
      exp.x <- -log10((rank(pvalues, ties.method="first")-.5)/n)
      pvalues <- -log10(pvalues)
    } else {
      exp.x <- -log10((n-rank(pvalues, ties.method="first")-.5)/n)
    }
  }
  
  # AL GENOMIC CONTROL modification
  # we have pvalues wich are in fact -log10pvalues
  # and exp.x (expected -log10pvalues)
  # we estimate the regression of one on the other
  linReg=lm(pvalues ~ 0 + exp.x)
  print(" linear regression observed -log10pval on expected")
  print(summary(linReg))
  if(GC){
    # estimate coefficient under the null
    linReg=lm(pvalues[pvalues<=3] ~ 0 + exp.x[pvalues<=3])
    print(" linear regression observed -log10pval on expected ONLY for -log10pval <3")
    print(summary(linReg))
    b=linReg$coefficients[1]
    cat("b",b,"\n")
    #correct
    pvalues=pvalues/b
  }
  ## AL end of the modification
  
  #this is a helper function to draw the confidence interval
  panel.qqconf<-function(n, conf.points=1000, conf.col="gray", conf.alpha=.05, ...) {
    require(grid)
    conf.points = min(conf.points, n-1);
    mpts<-matrix(nrow=conf.points*2, ncol=2)
    for(i in seq(from=1, to=conf.points)) {
      mpts[i,1]<- -log10((i-.5)/n)
      mpts[i,2]<- -log10(qbeta(1-conf.alpha/2, i, n-i))
      mpts[conf.points*2+1-i,1]<- -log10((i-.5)/n)
      mpts[conf.points*2+1-i,2]<- -log10(qbeta(conf.alpha/2, i, n-i))
    }
    grid.polygon(x=mpts[,1],y=mpts[,2], gp=gpar(fill=conf.col, lty=0), default.units="native")
  }
  
  #reduce number of points to plot
  if (should.thin==T) {
    if (!is.null(grp)) {
      thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
                                exp.x = round(exp.x, thin.exp.places),
                                grp=grp))
      grp = thin$grp
    } else {
      thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
                                exp.x = round(exp.x, thin.exp.places)))
    }
    pvalues <- thin$pvalues
    exp.x <- thin$exp.x
  }
  gc()
  
  prepanel.qqunif= function(x,y,...) {
    A = list()
    A$xlim = range(x, y)*1.02
    A$xlim[1]=0
    A$ylim = A$xlim
    return(A)
  }
  
  #draw the plot
  xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
         prepanel=prepanel, scales=list(axs="i"), pch=pch,
         panel = function(x, y, ...) {
           if (draw.conf) {
             panel.qqconf(n, conf.points=conf.points, 
                          conf.col=conf.col, conf.alpha=conf.alpha)
           };
           panel.xyplot(x,y, ...);
           panel.abline(0,1);
         }, par.settings=par.settings, ...
  )
}

# simulate p-values under the null
set.seed(1234)
pvalOK=runif(40000)
hist(pvalOK)
myQQplot(pvalOK)
qqunif.plot(pvalOK)

#simulate *wrong* p-values 
pvalNotOK=rbeta(40000,.5,2)
hist(pvalNotOK)
myQQplot(pvalNotOK)
qqunif.plot(pvalNotOK,GC=FALSE)
# genomic control on these p-values looks good !!
qqunif.plot(pvalNotOK,GC=TRUE)