CUDA launch failure with empty domains
The fix for #1721 introduced CUDA kernel calls with no work, which seems to lead to mdrun termination with a CUDA launch failure. Obviously, if I would have known this wasn't allowed, I wouldn't have done this.
Somehow our release-5-0 jenkins test didn't catch this, whereas we now get many error with jenkins master.
This should be fixed first in release-4-6.
Assert the size of non-bonded GPU work units
<=0 sized grid blocks lead to kernel launch failure which have
caused issues before.
Add integration test for empty domain
This box is large enough that the default 2D DD will contain an
empty domain (at least initially).
Current Jenkins will run this with two domains when run with real MPI.
Extended the CMakery to make this also work with thread-MPI.
#3 Updated by Mark Abraham about 2 years ago
I double-checked, and both the 4.6->5.0 and 5.0->master merges passed verify without triggering the issue, which was somewhat unlucky, because an error on those commits would have been a red flag.
It does show that there's no assertion or test too stupid to have... relying on calling code not to call with zero, or assuming that the kernel will work when called with zero are too risky. For example, it is now obvious that we should have a test that deliberately runs an empty box...
#6 Updated by Szilárd Páll about 2 years ago
I've checked to see why did we not trigger the bug in 4.6/5.0. The LJPME test in both 5.0/master contains a lipid and a water molecule in a big box which is the one that mostly triggered in on master.
While I could not find an answer for the cause of the different behavior on master vs 5.0, what I realized is that yet again it is the the ridiculously limited range of parallelization setups tested is one of the major contributors to not triggering the bug. We run at most with 2-3-way DD and the long lipid molecule will probably span across so few multiple domains. Both 2D and 3D decomposition tests (the lack of which has already caused pain before) would have caught the bug.
This problem will not solve itself, so at some point setting up new tests and revamping the Gerrit configs/trigger script to test parallel runs needs to become priority. :-/