mdrun crash with high density of particles and SD integrator
See the attached tpr (split in two parts because it's 90 MB compressed).
#2 Updated by Vedran Miletic over 2 years ago
Thank you for quick response. Unfortunately, halved example doesn't crash. I don't use MPI and/or GPU and this is reproducible on multiple machines, one example:
Running on 1 node with total 8 cores, 8 logical cores, 0 compatible GPUs
Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256
#5 Updated by Berk Hess over 2 years ago
- Status changed from New to Feedback wanted
- Assignee set to Berk Hess
This could indeed be an integer overflowing, in the pair list.
So likely the system will run with domain decomposition, which is likely also faster because ordering of particles improves cache hits. Could you try with -ntmpi 2? You can also try -ntmpi 4 and 8 and see what is fastest.
#6 Updated by Berk Hess over 2 years ago
- Status changed from Feedback wanted to In Progress
I ran -mtpi 2 and 4 myself. All crash with an atom flying away:
Atom 3595214 moved more than the distance allowed by the domain decomposition (125.000000) in direction X
distance out of cell 403.997559
New coordinates: 528.998 495.989 98.298
CPU runs hang at step 40, the second domain decomposition step.
So my first guess is that your setup is unstable.
#7 Updated by Berk Hess over 2 years ago
- Status changed from In Progress to Rejected
Have you even looked at the energy output at step 0? I get:
Large VCM: 505.20956, -0.00001, -0.00002, Temp-cm: 1.657
Bond Angle LJ (SR) Coulomb (SR) Potential
9.91842e+05 8.50307e+06 1.09500e+19 0.00000e+00 1.09500e+19
Kinetic En. Total Energy Temperature Pressure (bar)
2.22746e+35 2.22746e+35 3.19730e+30 1.97268e+28
So your initial setup seems to have atom overlap.