Bug #2298

Dynamic Load Balancing Crash: GROMACS 2016.3

Added by Daniel Kozuch over 2 years ago. Updated over 2 years ago.

Target version:
Affected version - extra info:
Affected version:



Mark asked me to file this bug. From the original email:

I recently started experiencing a error with GROMACS 2016.3 during a replica exchange simulation with 80 replicas, 480 cpus, and 40 GPUs:

Assertion failed:
Condition: comm->cycl_n[ddCyclStep] > 0
When we turned on DLB, we should have measured cycles

The simulation then crashes. I turned off DLB with the flag "-dlb no" and the error did not resurface so I have to assume it really is the DLB causing the issue."

I have attached the output of grompp, the .tpr file, and a sample log file.

My general remd setup was 80 folders with simulations at different temperatures. The execution command was:
srun $gmx mdrun -deffnm ${protein}_sim -multidir $dirs -replex 500 > mdrun_sim.txt 2>&1

Because of cluster specific reasons, I was running on 40 nodes, 1 GPU per node, and 12 cpus per node.

Let me know if you need for info.

grompp_sim.txt (3.55 KB) grompp_sim.txt Daniel Kozuch, 11/19/2017 11:06 PM
#1tsk_sim.log.1# (300 KB) #1tsk_sim.log.1# Daniel Kozuch, 11/19/2017 11:06 PM
1tsk_sim.tpr (515 KB) 1tsk_sim.tpr Daniel Kozuch, 11/19/2017 11:06 PM

Associated revisions

Revision aa8b4721 (diff)
Added by Berk Hess over 2 years ago

Do not turn on DLB at replica exchange

Turning on DLB right after exchanging replicas caused an assertion
failure and is also useless.

Fixes #2298

Change-Id: I20c3cb6ef3d74907d53d447fa9b0c9168f03c769


#1 Updated by Berk Hess over 2 years ago

  • Status changed from New to Accepted
  • Assignee set to Berk Hess

I don't see how this bug could occur.
But why are you using domain decomposition? I think you would get much better performance when only using 1 MPI rank per replica.

#2 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2298.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2016~I20c3cb6ef3d74907d53d447fa9b0c9168f03c769
Gerrit URL:

#3 Updated by Berk Hess over 2 years ago

  • Status changed from Accepted to Fix uploaded
  • Target version set to 2016.5

The assertion failure is likely caused by DLB being turned on right after replica exchange. I uploaded a fix for the 2016 release.

#4 Updated by Berk Hess over 2 years ago

  • Status changed from Fix uploaded to Resolved

#5 Updated by Berk Hess over 2 years ago

#6 Updated by Mark Abraham over 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF