Bug #3246
GPU code misses settle error check, simulation crashes with segfault without any further output
Description
Running the 216 SPC water box on a single GPU with cut-off electrostatics segfaults without any other warning.
Find attached a stack-trace of the core dump.
History
#1 Updated by Berk Hess about 1 year ago
- Status changed from New to Rejected
You are running NVE with a plain Coulomb cut-off at 0.4 nm, so of course the energy goes up and the system explodes.
grompp prints a note about using plain cut-off. Maybe this should be a warning?
This runs fine with reaction-field.
#2 Updated by Christian Blau about 1 year ago
The issue here is that it just crashes out without any information, whereas for example in GROMACS2018.6 I get
step 30096: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates
and at least the pdb files with the respective coordinates
#3 Updated by Berk Hess about 1 year ago
- Category set to mdrun
- Status changed from Rejected to Accepted
- Assignee set to Artem Zhmurov
Ah, so the settle error check is missing in the GPU code.
#4 Updated by Artem Zhmurov about 1 year ago
Anyone knows any standard solution for checking for errors inside the CUDA kernel? I can do conditional atomicSet(..) or conditionally set the variable in host memory directly from the kernel so that it can be checked in the host code. But I don't think that this is a proper solution.
#5 Updated by Christian Blau about 1 year ago
- Subject changed from Small water box simulation crashes with segfault to GPU code misses settle error check, simulation crashes with segfault without any further output
#6 Updated by Artem Zhmurov about 1 year ago
@Alan, do you know the way to inform CPU code that there was an exception in one of the threads?
#7 Updated by Artem Zhmurov about 1 year ago
Christian, can you try if this fix works for you: https://gerrit.gromacs.org/#/c/gromacs/+/14879/
#8 Updated by Paul Bauer about 1 year ago
@Artem and @Christian, will this go in for the rc or should I bump?
#9 Updated by Artem Zhmurov about 1 year ago
- Target version changed from 2020-rc1 to 2021-infrastructure-stable
I bumped it. Although there is a fix uploaded, it is not reliable enough. I don't think that there is a reliable solution to this problem though due to asynchronicity.