Coulomb energy sometimes wrong in complex/nbnxn_vsite
Found an issue while testing today on tcbs26 (2 titanx, 36 core) - coulomb(sr) only had a wrong value at step 20. Nobody had any bright ideas, so I did a
regression bisection. The first bad commit is 6106367bd902e228c32cfcf7e079f3fa6493c15a (intra-gpu load balancing)
For a diagnostic, I was using complex/nbnxn_vsite. Step 0 is fine, step 20 shows Coulomb (SR) about 4500kj/mol lower than reference. This happens with both release and debug build.
build=debug; export OMP_NUM_THREADS=5; ~/git/r51/build-cmake-gcc-gpu-$build-tcbs26/bin/gmx mdrun -s reference_s -ntmpi 6 -gpu_id 000111 -notunepme -nsteps 20 -v -nocopyright && gmx check -quiet -e reference_s -e2 ener
Using a single OpenMP thread is OK always.
This .tpr can run with only 1,2,3,4,6 ranks. Only 6 ranks shows the issue, and does so seemingly regardless of how you fill -gpu_id (all 0 and mix of 01 both show the problem).
Fix bug in GPU list balancing
The function split_sci_entry could produce empty lists, which can
cause illegal memory access or incorrect energies. Before commit
6106367b this bug was never triggered, since nsp_max was never smaller
than a full cj4 entry. But 6106367b introduced a but that could
produce negative nsp_max.
Fix bug in GPU list balancing
The function split_sci_entry could produce empty lists. This seems
not to have caused incorrect results, only slight extra processing
of empty workunits in the CUDA kernel. Incorrect Coulomb energies
could appear for empty lists with shift=CENTRAL, but that does not
seem to happen.
#5 Updated by Mark Abraham over 2 years ago
It does feel a bit like the "problem" commit is merely uncovering an existing issue - cherry picking https://gerrit.gromacs.org/#/c/4824/2 onto the above commit makes the issue go away.
My theory at the moment is that the modified lists are ok (since the forces are right because the virial is right), but somehow getting messed up when something happens on multiple OpenMP threads. Yet I don't think OpenMP does anything with energies once the D2H transfer is done. If it does, then perhaps if I hack the build system to enable GMX_EMULATE_GPU in a build that doesn't link CUDA then I can use some of the Sanitizer / valgrind / ddt tools to find an issue. (Previous attempt to use CUDA with our TSan build type failed to link, even after working around some issues. Googling doesn't suggest CUDA + sanitizer is ever a thing. I tried DDT memory debugging, but with CUDA it needs an driver version that matches it (5.5) and machines I've looked at have 7.0 installed, and that's a problem I didn't try to solve today.)
#9 Updated by Berk Hess over 2 years ago
I reproduced the empty list issue in 5.0, which means it's probably also present in 4.6. But the results are all correct. This is probably (nearly) always the case, otherwise someone would have caught the bug. Still the fix should be backported to 4.6 and 5.0.