Project

General

Profile

Bug #1767

Coulomb energy sometimes wrong in complex/nbnxn_vsite

Added by Mark Abraham almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Found an issue while testing today on tcbs26 (2 titanx, 36 core) - coulomb(sr) only had a wrong value at step 20. Nobody had any bright ideas, so I did a regression bisection. The first bad commit is 6106367bd902e228c32cfcf7e079f3fa6493c15a (intra-gpu load balancing)

For a diagnostic, I was using complex/nbnxn_vsite. Step 0 is fine, step 20 shows Coulomb (SR) about 4500kj/mol lower than reference. This happens with both release and debug build.

build=debug; export OMP_NUM_THREADS=5; ~/git/r51/build-cmake-gcc-gpu-$build-tcbs26/bin/gmx mdrun -s reference_s -ntmpi 6 -gpu_id 000111 -notunepme -nsteps 20 -v -nocopyright && gmx check -quiet -e reference_s -e2 ener

Using a single OpenMP thread is OK always.

This .tpr can run with only 1,2,3,4,6 ranks. Only 6 ranks shows the issue, and does so seemingly regardless of how you fill -gpu_id (all 0 and mix of 01 both show the problem).

Associated revisions

Revision 466e3633 (diff)
Added by Berk Hess almost 2 years ago

Fix bug in GPU list balancing

The function split_sci_entry could produce empty lists, which can
cause illegal memory access or incorrect energies. Before commit
6106367b this bug was never triggered, since nsp_max was never smaller
than a full cj4 entry. But 6106367b introduced a but that could
produce negative nsp_max.

Fixes #1767.

Change-Id: I2007cf6851f94f4f2ca62f609a0628725014dbe7

Revision eea54d01 (diff)
Added by Berk Hess almost 2 years ago

Fix bug in GPU list balancing

The function split_sci_entry could produce empty lists. This seems
not to have caused incorrect results, only slight extra processing
of empty workunits in the CUDA kernel. Incorrect Coulomb energies
could appear for empty lists with shift=CENTRAL, but that does not
seem to happen.

Refs #1767.

Change-Id: I0b0ff0a450734d4863f1e9636ff5741d4f1a68da

History

#1 Updated by Szilárd Páll almost 2 years ago

Mark Abraham wrote:

Found an issue while testing today on tcbs26 (2 titanx, 36 core) - coulomb(sr) only had a wrong value at step 20. Nobody had any bright ideas, so I did a regression.

You mean bisection, right?

#2 Updated by Mark Abraham almost 2 years ago

  • Description updated (diff)

#3 Updated by Mark Abraham almost 2 years ago

changing gpu_min_ci_balanced_factor back to 40 doesn't help, as expected.

#4 Updated by Mark Abraham almost 2 years ago

export GMX_NB_MIN_CI=1 seems to fix it

#5 Updated by Mark Abraham almost 2 years ago

It does feel a bit like the "problem" commit is merely uncovering an existing issue - cherry picking https://gerrit.gromacs.org/#/c/4824/2 onto the above commit makes the issue go away.

My theory at the moment is that the modified lists are ok (since the forces are right because the virial is right), but somehow getting messed up when something happens on multiple OpenMP threads. Yet I don't think OpenMP does anything with energies once the D2H transfer is done. If it does, then perhaps if I hack the build system to enable GMX_EMULATE_GPU in a build that doesn't link CUDA then I can use some of the Sanitizer / valgrind / ddt tools to find an issue. (Previous attempt to use CUDA with our TSan build type failed to link, even after working around some issues. Googling doesn't suggest CUDA + sanitizer is ever a thing. I tried DDT memory debugging, but with CUDA it needs an driver version that matches it (5.5) and machines I've looked at have 7.0 installed, and that's a problem I didn't try to solve today.)

#6 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #1767.
Uploader: Berk Hess ()
Change-Id: I2007cf6851f94f4f2ca62f609a0628725014dbe7
Gerrit URL: https://gerrit.gromacs.org/4840

#7 Updated by Berk Hess almost 2 years ago

  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess

#8 Updated by Mark Abraham almost 2 years ago

So it wasn't an existing issue - we found a new way to generate an empty list, which is invalid. We might assert on that if we did a special loop before exiting the search code (in debug mode), or maybe in the C reference GPU kernels?

#9 Updated by Berk Hess almost 2 years ago

I reproduced the empty list issue in 5.0, which means it's probably also present in 4.6. But the results are all correct. This is probably (nearly) always the case, otherwise someone would have caught the bug. Still the fix should be backported to 4.6 and 5.0.

#10 Updated by Mark Abraham almost 2 years ago

OK. Michael has reportedly finished a bug fix for md-vv / fe or something, so we can ship the last 4.6 with this fix also. I can probably work out how to do it if you don't have time.

#11 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #1767.
Uploader: Berk Hess ()
Change-Id: I0b0ff0a450734d4863f1e9636ff5741d4f1a68da
Gerrit URL: https://gerrit.gromacs.org/4842

#12 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Fix uploaded to Resolved

#13 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Resolved to Closed

#14 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Closed to Fix uploaded

#15 Updated by Mark Abraham almost 2 years ago

Pre-fix, I could indeed observe empty lists in nbnxn_vsite, but they occurred with both shift 22 (central) and 7. They went away with the fix.

#16 Updated by Mark Abraham almost 2 years ago

  • Target version changed from 5.1 to 4.6.8

This is fixed already in 5.1-rc1

#17 Updated by Gerrit Code Review Bot almost 2 years ago

Incorrect linking of draft (why the hell are we linking drafts anyway?!)

#18 Updated by Mark Abraham almost 2 years ago

Gerrit Code Review Bot wrote:

Incorrect linking of draft (why the hell are we linking drafts anyway?!)

We link drafts so that people know some relevant work may have happened and we don't waste each other's time.

#19 Updated by Mark Abraham over 1 year ago

  • Status changed from Fix uploaded to Resolved

#20 Updated by Mark Abraham over 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF