Project

General

Profile

Bug #1734

CUDA launch failure with empty domains

Added by Berk Hess almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The fix for #1721 introduced CUDA kernel calls with no work, which seems to lead to mdrun termination with a CUDA launch failure. Obviously, if I would have known this wasn't allowed, I wouldn't have done this.
Somehow our release-5-0 jenkins test didn't catch this, whereas we now get many error with jenkins master.
This should be fixed first in release-4-6.

Associated revisions

Revision 56a67d7c (diff)
Added by Berk Hess almost 2 years ago

Fixed CUDA error with empty domains

Recent commit fc8a5624 introduced empty CUDA kernel calls when there
are empty domains. This seems not be allowed by CUDA (we get errors).
Fixed #1734. Refs #1721.

Change-Id: Ifd32a55c8d6756c93a0fcaba29983ae326abc569

Revision aff6e44b (diff)
Added by Szilárd Páll almost 2 years ago

Assert the size of non-bonded GPU work units

<=0 sized grid blocks lead to kernel launch failure which have
caused issues before.

Refs #1734

Change-Id: I4e914bcf3168f7268dab64b69d25bf34fb6c85c9

Revision 011ec764 (diff)
Added by Mark Abraham almost 2 years ago

Add integration test for empty domain

This box is large enough that the default 2D DD will contain an
empty domain (at least initially).

Current Jenkins will run this with two domains when run with real MPI.
Extended the CMakery to make this also work with thread-MPI.

Refs #1734

Change-Id: I7b7269cdd1faaa562afeb7a5dd3f75fc19ceff85

History

#1 Updated by Mark Abraham almost 2 years ago

Yeah, in the car this afternoon Szilard thought of much the same thing as a theory about the reason for the instability of the CUDA slaves. :-(

#2 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #1734.
Uploader: Berk Hess ()
Change-Id: Ifd32a55c8d6756c93a0fcaba29983ae326abc569
Gerrit URL: https://gerrit.gromacs.org/4583

#3 Updated by Mark Abraham almost 2 years ago

I double-checked, and both the 4.6->5.0 and 5.0->master merges passed verify without triggering the issue, which was somewhat unlucky, because an error on those commits would have been a red flag.

It does show that there's no assertion or test too stupid to have... relying on calling code not to call with zero, or assuming that the kernel will work when called with zero are too risky. For example, it is now obvious that we should have a test that deliberately runs an empty box...

#4 Updated by Berk Hess almost 2 years ago

  • Status changed from New to Fix uploaded
  • Target version changed from 5.0.6 to 4.6.8

#5 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '2' for Issue #1734.
Uploader: Szilárd Páll ()
Change-Id: I4e914bcf3168f7268dab64b69d25bf34fb6c85c9
Gerrit URL: https://gerrit.gromacs.org/4592

#6 Updated by Szilárd Páll almost 2 years ago

I've checked to see why did we not trigger the bug in 4.6/5.0. The LJPME test in both 5.0/master contains a lipid and a water molecule in a big box which is the one that mostly triggered in on master.

While I could not find an answer for the cause of the different behavior on master vs 5.0, what I realized is that yet again it is the the ridiculously limited range of parallelization setups tested is one of the major contributors to not triggering the bug. We run at most with 2-3-way DD and the long lipid molecule will probably span across so few multiple domains. Both 2D and 3D decomposition tests (the lack of which has already caused pain before) would have caught the bug.

This problem will not solve itself, so at some point setting up new tests and revamping the Gerrit configs/trigger script to test parallel runs needs to become priority. :-/

#7 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #1734.
Uploader: Mark Abraham ()
Change-Id: I7b7269cdd1faaa562afeb7a5dd3f75fc19ceff85
Gerrit URL: https://gerrit.gromacs.org/4597

#8 Updated by Berk Hess almost 2 years ago

  • Status changed from Fix uploaded to Closed

Fixed for 4.6.8, merged into release-5.0 (for 5.0.6) and master.

Also available in: Atom PDF