Project

General

Profile

Bug #2440

Multidir simulations can stop at different times when using mdrun -maxh

Added by Viveca Lindahl about 2 years ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
GROMACS: gmx mdrun, version 2018.1-dev-20180306-33093601f
Affected version:
Difficulty:
uncategorized
Close

Description

I am running mdrun -multi dir-{1..4} (together with awh-share-multisim, but I think it's irrelevant). I'm running on a cluster, submitting many jobs with -maxh 0.25 -cpi. At some point I get (see


Step 57822720: Run time exceeded 0.247 hours, will terminate the run

Step 57822760: Run time exceeded 0.247 hours, will terminate the run

Step 57822760: Run time exceeded 0.247 hours, will terminate the run

Step 57822760: Run time exceeded 0.247 hours, will terminate the run

see attached slurm-last-ok.out. At the next restart I get a fatal error:

simulation part is not equal for all subsystems
  subsystem 0: 52
  subsystem 1: 51
  subsystem 2: 51
  subsystem 3: 51

-------------------------------------------------------
Program:     gmx mdrun, version 2018.1-dev-20180306-33093601f
Source file: src/gromacs/mdlib/main.cpp (line 115)
MPI rank:    0 (out of 128)

Fatal error:
The 4 subsystems are not compatible

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors


see slurm-first-error.out. What is the intended behavior is here? I think it should either be ok that the different simulations start at different steps or they should never terminate at different steps.

slurm-first-error.out (3.35 KB) slurm-first-error.out Viveca Lindahl, 03/10/2018 12:52 PM
slurm-last-ok.out (19.9 KB) slurm-last-ok.out Viveca Lindahl, 03/10/2018 12:52 PM
md-0.log (495 KB) md-0.log Viveca Lindahl, 03/10/2018 12:52 PM

Related issues

Related to GROMACS - Bug #2447: "gmx mdrun -maxh" not working with AWH Closed
Related to GROMACS - Bug #3448: GMX 2020.1 - Multidir simulations can stop at different times when killed by job managerNew

Associated revisions

Revision 717fd5b8 (diff)
Added by Berk Hess about 2 years ago

Fix AWH ensemble checkpointing

AWH multi-runs with shared bias could terminate at different steps
leading to checkpointing from which continuation was impossible.

Fixes #2440.

Change-Id: I728c6aea9d030fcfad04efc6082309dafe3408bd

Revision b1d72b28 (diff)
Added by Berk Hess about 2 years ago

Fix mdrun signalling with AWH bias sharing

The recent commit 717fd5b8, which should have fixed issue #2440,
disabled checkpointing and termination by disabled signalling
between AWH runs that share a bias.

Fixes #2447
Refs #2440

Change-Id: Ic80ea07cfa1bde1b31fa9d660f00492b794b000f

Revision e4deba12 (diff)
Added by Berk Hess about 2 years ago

Improve inter-simulation signaling

Use a single boolean for controlling inter-simulation signaling.
This would have prevented issue #2447.
Introduce an inter-simulating signaling frequency. This both
simplifies the code and significantly reduces communiation overhead
for many simulations and/or many ranks per simulation.
Correct the terminimation step print with replica exchange.

Refs #2440 and #2447

Change-Id: I8ed50c5b11564d9a967bcacb15c4fcd3be05caea

Revision 0fa62d3e (diff)
Added by Berk Hess about 1 month ago

Add fatal error when multisim runs sharing state have different init_step

Refs #2440
Fixes #3990

Change-Id: I052cd53dc9517a3df53663e52720ee5a80ea65c0

Revision ade8e131 (diff)
Added by Magnus Lundborg about 1 month ago

Fix out of sync checkpoint files in simulations sharing state

When multidir simulations share the state, the checkpoint files
of the different simulations should all be from the same step.
To ensure this, MPI barriers have been added before renaming
the checkpoint files from their temporary to their final names.
So now the contents can never be out of sync. In the worst, and
rather unlikely, case that something going wrong during renaming,
some checkpoint files could have temporary and some final names.

Refs #2440.

Change-Id: I88088abb726a36dbf9a9db2fa2eb4a46c3bf2cd7

History

#1 Updated by Mark Abraham about 2 years ago

Possibly same issue as #2233

#2 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2440.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I728c6aea9d030fcfad04efc6082309dafe3408bd
Gerrit URL: https://gerrit.gromacs.org/7664

#3 Updated by Berk Hess about 2 years ago

  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess
  • Target version set to 2018.1

#4 Updated by Berk Hess about 2 years ago

  • Status changed from Fix uploaded to Resolved

#5 Updated by Viveca Lindahl about 2 years ago

  • Related to Bug #2447: "gmx mdrun -maxh" not working with AWH added

#6 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2440.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~Ic80ea07cfa1bde1b31fa9d660f00492b794b000f
Gerrit URL: https://gerrit.gromacs.org/7675

#7 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2440.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I8ed50c5b11564d9a967bcacb15c4fcd3be05caea
Gerrit URL: https://gerrit.gromacs.org/7679

#8 Updated by Mark Abraham about 2 years ago

  • Status changed from Resolved to Closed

#9 Updated by Magnus Lundborg about 2 months ago

  • Status changed from Closed to Accepted
  • Target version changed from 2018.1 to 2019.6
  • Affected version changed from 2018.1 to 2019.5

This still happens in 2019.5, so I am opening this again. I am not sure if it is related to -maxh or just when jobs get terminated, e.g., by a slurm timeout. I have only suffered from it when running awh, but since that is the only way I am using -multidir simulations I cannot say if it is related to awh or not.

#10 Updated by Magnus Lundborg about 2 months ago

  • Target version changed from 2019.6 to 2020.2

#11 Updated by Magnus Lundborg about 2 months ago

  • Target version changed from 2020.2 to 2020.1

#12 Updated by Daniel Kozuch about 2 months ago

Magnus Lundborg wrote:

This still happens in 2019.5, so I am opening this again. I am not sure if it is related to -maxh or just when jobs get terminated, e.g., by a slurm timeout. I have only suffered from it when running awh, but since that is the only way I am using -multidir simulations I cannot say if it is related to awh or not.

Apologies if this is not the right place to post this - but I can confirm that the issue appears for 2019.5 during normal replica exchange without -maxh (slurm terminated)

#13 Updated by Mark Abraham about 1 month ago

Fundamentally there's no promise of coherency between MPI operations and filesystem state, which is particularly problematic when multiple output files are involved. The best one can do is try to read the output file to observe its state is consistent, and communicate about that. But that's still not a promise that the file was flushed to physical disk

#14 Updated by Paul Bauer about 1 month ago

  • Status changed from Accepted to Resolved

the main issue here should be fixed with the related patches, but it again points to it that more work is needed here

#15 Updated by Paul Bauer about 1 month ago

  • Status changed from Resolved to Closed

#16 Updated by Szilárd Páll 14 days ago

  • Related to Bug #3448: GMX 2020.1 - Multidir simulations can stop at different times when killed by job manager added

Also available in: Atom PDF