Monday, October 9, 2017

Anatomy of a solved bug in blupf90

I had a request from Alban Bouquet to look into a strange Single Step GBLUP problem in multiple-trait pig breeding data. Using freely available blupf90 and the default iterative solver (PCG), a sequence of NaNs (not a number) is produced, with no solution. The intriguing aspect is that deleting a few records or genotypes at random produced a correct sequence of iterations leading to sensible solutions of the SSGBLUP equations. Also, the SSGBLUP run perfectly (and quicker) using blup90iod2.

Let's say that the problem is roughly equivalent in size to 1,000,000 animals in pedigree and 40,000 genotyped individuals.

Using debug options in compilation

optf90 = -g -traceback -check all -fpe0 
optf77 = -g -traceback -check all -fpe0 

showed that in the pcg.f90 subroutine there was a division by 0 in subroutine

subroutine solve_sparse_hashm_pcg 

a subroutine that worked perfectly for many many years. In this line:

alpha=tau/dot_product(p,w)

it turned out that w (a vector) had all values of 0. However, this vector is computed as

w=mult(a,p)

where a is a sparse hash matrix as described in the course notes. This mult is carried out in file fspak90.f90 (which can be downloaded here). If a has elements and p has elements, how come that their product is 0? The reason is that a , which is basically a three-column array, contains  lots of lines: actually, 2147483648 elements. It turns out that this is beyond the highest integer in fortran, which is 2147483647. In order to store very large matrices, the hash structure uses "bigger" integers(i8) (8 bytes) up to a MUCH larger value. So, when a hash matrix arrives with its integer(i8) to the naif mult subroutine with its "normal" integers, there is a conversion problem - and the result is 0.

Fixing the problem is as simple as changing
integer::j
into
integer(i8)::j

and the program now works perfectly...