Project

General

Profile

Bug #3412

Domain decomposition problems with Gromacs >2018

Added by Michael Gecht 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
2019.6, 2020, 2020.1
Affected version:
Difficulty:
uncategorized
Close

Description

I am trying to benchmark a number of systems with different Gromacs versions, but encounter problems with the two latest major Gromacs versions (2019.x and 2020.x). All systems in question are equilibrated and have run as production simulations for several microseconds each (with version 2018 or lower). I am encountering two kinds of errors:

Error 1

[...]
Not all bonded interactions have been properly assigned to the domain decomposition cells
A list of missing interactions:
                Bond of 246240 missing   1286
               Angle of 914085 missing   6160
         Proper Dih. of 1428030 missing  14240
       Improper Dih. of 114615 missing    901
               LJ-14 of 1321920 missing   9255
[...]

Error 2

[...]
NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'Protein in water'
25000000 steps,  50000.0 ps.

-------------------------------------------------------
Program:     gmx mdrun, version 2019.5
Source file: src/gromacs/ewald/pme-redistribute.cpp (line 282)
MPI rank:    17 (out of 20)

Fatal error:
64 particles communicated to PME rank 17 are more than 2/3 times the cut-off
out of the domain decomposition cell of their charge group in dimension y.
This usually means that your system is not well equilibrated.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Abort(1) on node 17 (rank 17 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17
In: PMI_Abort(1, application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17)
[...]

The problem appears for systems of various sizes (100k, 200k, 300k, 3.6m atoms), force fields (charmm36, charmm36m, amber99sb-star-ildn-q) and system compositions (single protein in solution, dense protein solutions and protein membrane systems). It does not appear for a number of different systems that I am testing as well (varying the same parameters).

Attached are log files for Gromacs 2018.8 (works), 2019.6 (does not work) and 2020.1 (does not work), as well as an example submission file. 2019.5 and 2020 do not work either. Comparing the logs between 2018.8 and 2019.6, it becomes apparent that newer Gromacs verions decrease the domain size and increase the number of domains. In addition, the maximum distances of bonded interactions are increased.

I am happy to share a 3.65m atom system that experiences these problems privately via e-mail using Nextcloud.

Things I tried

All tests were run on an in-house cluster using SLURM. The same problems can be reproduced on a different machine (MPCDF cobra cluster).

Use original TPR file

The "original" TPR files were created either with 2016.x or 2018.x and were used to run the actual production simulations. Using these TPRs, the simulations run on 2018.x, but fail to start on 2019.x and 2020x.

Generate new TPRs

Generated a new TPR using a specific version of Gromacs (2018.8, 2019.5, 2019.6, 2020, 2020.1) with:

gmx grompp -f md.mdp -c ref.gro -r ref.gro -n index.ndx -p topol.top -o md.tpr

Again, 2018.8 works, but more recent versions do not.

Decrease max. distance of bonded interactions with -rdd

I tried using -rdd to set the max. distance of interactions for version 2019.6 to the same as in 2018.8 (as reported in the md.log file). The system crashes with the same error.

Run on a login node

Running interactively on a login node (without queuing system) works!

gmx mdrun -v -deffnm md

Run with 1 MPI process / 1 OpenMP threads

Submission to the queue with only one MPI rank works!

I hope this is the correct place to report/discuss such an issue!

2019.6_md.log (17.1 KB) 2019.6_md.log Gromacs 2019.6 log file: simulations crash Michael Gecht, 03/06/2020 03:26 PM
2018.8_md.log (39.9 KB) 2018.8_md.log Gromacs 2018.8 log file: simulations run without problems Michael Gecht, 03/06/2020 03:26 PM
2020.1_md.log (25 KB) 2020.1_md.log Gromacs 2020.1 log file: simulations crash Michael Gecht, 03/06/2020 03:26 PM
2019.6_tjob.err (3.15 KB) 2019.6_tjob.err Gromacs 2019.6 tjob.err Michael Gecht, 03/06/2020 03:51 PM
2018.8_tjob.err (13.6 KB) 2018.8_tjob.err Gromacs 2018.8 tjob.err Michael Gecht, 03/06/2020 03:51 PM
2020.1_tjob.err (9.3 KB) 2020.1_tjob.err Gromacs 2020.1 tjob.err Michael Gecht, 03/06/2020 03:51 PM
bench.job (599 Bytes) bench.job SLURM submission file Michael Gecht, 03/06/2020 03:54 PM

History

#1 Updated by Michael Gecht 8 months ago

Forgot to mention: it looks like it is related to #3204. Not a single ITP file of any system contains the "[intermolecular_interactions]"-header, so I figured it has nothing to do with it? Also, it is supposed to be fixed in 2019.5 that produces the error.

Also available in: Atom PDF