Bug #3412
Domain decomposition problems with Gromacs >2018
Description
I am trying to benchmark a number of systems with different Gromacs versions, but encounter problems with the two latest major Gromacs versions (2019.x and 2020.x). All systems in question are equilibrated and have run as production simulations for several microseconds each (with version 2018 or lower). I am encountering two kinds of errors:
Error 1¶
[...] Not all bonded interactions have been properly assigned to the domain decomposition cells A list of missing interactions: Bond of 246240 missing 1286 Angle of 914085 missing 6160 Proper Dih. of 1428030 missing 14240 Improper Dih. of 114615 missing 901 LJ-14 of 1321920 missing 9255 [...]
Error 2¶
[...] NOTE: DLB will not turn on during the first phase of PME tuning starting mdrun 'Protein in water' 25000000 steps, 50000.0 ps. ------------------------------------------------------- Program: gmx mdrun, version 2019.5 Source file: src/gromacs/ewald/pme-redistribute.cpp (line 282) MPI rank: 17 (out of 20) Fatal error: 64 particles communicated to PME rank 17 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension y. This usually means that your system is not well equilibrated. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ------------------------------------------------------- Abort(1) on node 17 (rank 17 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17 In: PMI_Abort(1, application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17) [...]
The problem appears for systems of various sizes (100k, 200k, 300k, 3.6m atoms), force fields (charmm36, charmm36m, amber99sb-star-ildn-q) and system compositions (single protein in solution, dense protein solutions and protein membrane systems). It does not appear for a number of different systems that I am testing as well (varying the same parameters).
Attached are log files for Gromacs 2018.8 (works), 2019.6 (does not work) and 2020.1 (does not work), as well as an example submission file. 2019.5 and 2020 do not work either. Comparing the logs between 2018.8 and 2019.6, it becomes apparent that newer Gromacs verions decrease the domain size and increase the number of domains. In addition, the maximum distances of bonded interactions are increased.
I am happy to share a 3.65m atom system that experiences these problems privately via e-mail using Nextcloud.
Things I tried¶
All tests were run on an in-house cluster using SLURM. The same problems can be reproduced on a different machine (MPCDF cobra cluster).
Use original TPR file¶
The "original" TPR files were created either with 2016.x or 2018.x and were used to run the actual production simulations. Using these TPRs, the simulations run on 2018.x, but fail to start on 2019.x and 2020x.
Generate new TPRs¶
Generated a new TPR using a specific version of Gromacs (2018.8, 2019.5, 2019.6, 2020, 2020.1) with:
gmx grompp -f md.mdp -c ref.gro -r ref.gro -n index.ndx -p topol.top -o md.tpr
Again, 2018.8 works, but more recent versions do not.
Decrease max. distance of bonded interactions with -rdd¶
I tried using -rdd to set the max. distance of interactions for version 2019.6 to the same as in 2018.8 (as reported in the md.log file). The system crashes with the same error.
Run on a login node¶
Running interactively on a login node (without queuing system) works!
gmx mdrun -v -deffnm md
Run with 1 MPI process / 1 OpenMP threads¶
Submission to the queue with only one MPI rank works!
I hope this is the correct place to report/discuss such an issue!
History
#1 Updated by Michael Gecht 11 months ago
Forgot to mention: it looks like it is related to #3204. Not a single ITP file of any system contains the "[intermolecular_interactions]"-header, so I figured it has nothing to do with it? Also, it is supposed to be fixed in 2019.5 that produces the error.