Project

General

Profile

Bug #3246

GPU code misses settle error check, simulation crashes with segfault without any further output

Added by Christian Blau 7 months ago. Updated 7 months ago.

Status:
Accepted
Priority:
Normal
Assignee:
Category:
mdrun
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Running the 216 SPC water box on a single GPU with cut-off electrostatics segfaults without any other warning.

Find attached a stack-trace of the core dump.

grompp.mdp (42 Bytes) grompp.mdp Christian Blau, 12/17/2019 10:51 AM
ener.edr (12.8 KB) ener.edr Christian Blau, 12/17/2019 10:51 AM
md.log (24.3 KB) md.log Christian Blau, 12/17/2019 10:51 AM
spc216.gro (28.6 KB) spc216.gro Christian Blau, 12/17/2019 10:51 AM
state.cpt (16.4 KB) state.cpt Christian Blau, 12/17/2019 10:51 AM
topol.top (1016 Bytes) topol.top Christian Blau, 12/17/2019 10:51 AM
topol.tpr (74.4 KB) topol.tpr Christian Blau, 12/17/2019 10:51 AM
stacktrace.png (52.6 KB) stacktrace.png Christian Blau, 12/17/2019 10:55 AM
compile_commands.json (1.06 MB) compile_commands.json Christian Blau, 12/17/2019 10:56 AM

History

#1 Updated by Berk Hess 7 months ago

  • Status changed from New to Rejected

You are running NVE with a plain Coulomb cut-off at 0.4 nm, so of course the energy goes up and the system explodes.
grompp prints a note about using plain cut-off. Maybe this should be a warning?
This runs fine with reaction-field.

#2 Updated by Christian Blau 7 months ago

The issue here is that it just crashes out without any information, whereas for example in GROMACS2018.6 I get

step 30096: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.
Wrote pdb files with previous and current coordinates

and at least the pdb files with the respective coordinates

#3 Updated by Berk Hess 7 months ago

  • Category set to mdrun
  • Status changed from Rejected to Accepted
  • Assignee set to Artem Zhmurov

Ah, so the settle error check is missing in the GPU code.

#4 Updated by Artem Zhmurov 7 months ago

Anyone knows any standard solution for checking for errors inside the CUDA kernel? I can do conditional atomicSet(..) or conditionally set the variable in host memory directly from the kernel so that it can be checked in the host code. But I don't think that this is a proper solution.

#5 Updated by Christian Blau 7 months ago

  • Subject changed from Small water box simulation crashes with segfault to GPU code misses settle error check, simulation crashes with segfault without any further output

#6 Updated by Artem Zhmurov 7 months ago

@Alan, do you know the way to inform CPU code that there was an exception in one of the threads?

#7 Updated by Artem Zhmurov 7 months ago

Christian, can you try if this fix works for you: https://gerrit.gromacs.org/#/c/gromacs/+/14879/

#8 Updated by Paul Bauer 7 months ago

@Artem and @Christian, will this go in for the rc or should I bump?

#9 Updated by Artem Zhmurov 7 months ago

  • Target version changed from 2020-rc1 to 2021-infrastructure-stable

I bumped it. Although there is a fix uploaded, it is not reliable enough. I don't think that there is a reliable solution to this problem though due to asynchronicity.

Also available in: Atom PDF