Project

General

Profile

Bug #2333

mdrun crash with high density of particles and SD integrator

Added by Vedran Miletic about 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

See the attached tpr (split in two parts because it's 90 MB compressed).

sd.tar.bz2.aa (45.8 MB) sd.tar.bz2.aa Vedran Miletic, 12/06/2017 01:37 PM
sd.tar.bz2.ab (40.1 MB) sd.tar.bz2.ab Vedran Miletic, 12/06/2017 01:38 PM

Associated revisions

Revision ea45ba98 (diff)
Added by Berk Hess almost 2 years ago

Check for large energy at first step

Also added step number to fatal error message.

Fixes #2333

Change-Id: I6e8aa1fac3a3c9a358b4046de5c8a3547ae14b15

History

#1 Updated by Paul Bauer about 2 years ago

Hello, could you provide a shorter example, as well as the details on how you ran the simulation (e.g. number of ranks, GPU usage, ...).
I'll try to reproduce this in the meantime, but more information would definitely help.
Thank you!

#2 Updated by Vedran Miletic about 2 years ago

Thank you for quick response. Unfortunately, halved example doesn't crash. I don't use MPI and/or GPU and this is reproducible on multiple machines, one example:

Running on 1 node with total 8 cores, 8 logical cores, 0 compatible GPUs
Hardware detected:
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256

#3 Updated by Paul Bauer about 2 years ago

Confirmed with gcc-4.8 and cuda-7.5. Looks like integer overflow to me

#4 Updated by Paul Bauer about 2 years ago

Some more questions here. Does the bug happen with different combinations of integrator/time step/thermostat?
Also, could you provide me with the files needed to generate the tpr file? So I can test the different combinations?
Thanks!

#5 Updated by Berk Hess about 2 years ago

  • Status changed from New to Feedback wanted
  • Assignee set to Berk Hess

This could indeed be an integer overflowing, in the pair list.
So likely the system will run with domain decomposition, which is likely also faster because ordering of particles improves cache hits. Could you try with -ntmpi 2? You can also try -ntmpi 4 and 8 and see what is fastest.

#6 Updated by Berk Hess about 2 years ago

  • Status changed from Feedback wanted to In Progress

I ran -mtpi 2 and 4 myself. All crash with an atom flying away:
Atom 3595214 moved more than the distance allowed by the domain decomposition (125.000000) in direction X
distance out of cell 403.997559
New coordinates: 528.998 495.989 98.298

CPU runs hang at step 40, the second domain decomposition step.
So my first guess is that your setup is unstable.

#7 Updated by Berk Hess about 2 years ago

  • Status changed from In Progress to Rejected

Have you even looked at the energy output at step 0? I get:
Large VCM: 505.20956, -0.00001, -0.00002, Temp-cm: 1.657
37e+07
Energies (kJ/mol)
Bond Angle LJ (SR) Coulomb (SR) Potential
9.91842e+05 8.50307e+06 1.09500e+19 0.00000e+00 1.09500e+19
Kinetic En. Total Energy Temperature Pressure (bar)
2.22746e+35 2.22746e+35 3.19730e+30 1.97268e+28

So your initial setup seems to have atom overlap.

#8 Updated by Berk Hess almost 2 years ago

  • Status changed from Rejected to Fix uploaded

Reopened because I uploaded a "fix" that checks for large energies and step 0 and that gives a fatal error on this system instead of an assertion failure.

#9 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #2333.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I6e8aa1fac3a3c9a358b4046de5c8a3547ae14b15
Gerrit URL: https://gerrit.gromacs.org/7325

#10 Updated by Berk Hess almost 2 years ago

  • Status changed from Fix uploaded to Resolved

#11 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF