Project

General

Profile

Bug #860

Checkpoint not created upon reaching time given in maxh

Added by Ben Reynwar almost 8 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I'm having a problem with gromacs not terminating as expected when using the maxh option.
It occurs when doing a REMD simulation with infinite cutoffs.
It does not occur for the first run, but only for the second run that was started from the first run checkpoints.
I have been using 2 processors for each replica.

The version used is 4.5.5 with a bug fix applied from
http://lists.gromacs.org/pipermail/gmx-developers/2011-October/005405.html

I'm specifying -maxh 24 and as expected see the following in the stderr output.

Step 773882: Run time exceeded 23.760 hours, will terminate the run
Step 773876: Run time exceeded 23.760 hours, will terminate the run
Step 773880: Run time exceeded 23.760 hours, will terminate the run
etc

However I can see that the output files continued to be written for
another hour until at 25 hours the simulation was terminated by the
queueing system.
No checkpoint files were produced. The output files show that the
simulation continued until about step 797000.

I attach the cpt and tpr files for starting a 2 replica simulation that exhibits this problem.

run0_0.cpt (174 KB) run0_0.cpt Ben Reynwar, 01/10/2012 05:58 PM
run0_0.tpr (924 KB) run0_0.tpr Ben Reynwar, 01/10/2012 05:58 PM
run0_1.cpt (174 KB) run0_1.cpt Ben Reynwar, 01/10/2012 05:58 PM
run0_1.tpr (924 KB) run0_1.tpr Ben Reynwar, 01/10/2012 05:58 PM
tahsp_amber03_chirre.itp (18.3 KB) tahsp_amber03_chirre.itp Ben Reynwar, 01/11/2012 06:34 PM
run0_0.mdp (4.04 KB) run0_0.mdp Ben Reynwar, 01/11/2012 06:34 PM
run0_1.mdp (4.04 KB) run0_1.mdp Ben Reynwar, 01/11/2012 06:34 PM
tahsp_starting.pdb (366 KB) tahsp_starting.pdb Ben Reynwar, 01/11/2012 06:34 PM
tahsp_amber03.top (1.39 KB) tahsp_amber03.top Ben Reynwar, 01/11/2012 06:34 PM
tahsp_amber03_alphacryst_posre.itp (3.97 KB) tahsp_amber03_alphacryst_posre.itp Ben Reynwar, 01/11/2012 06:34 PM
run0_0_p.top (739 KB) run0_0_p.top Ben Reynwar, 01/12/2012 09:27 PM
run0_1_p.top (739 KB) run0_1_p.top Ben Reynwar, 01/12/2012 09:27 PM
tahsp_amber03_Protein.itp (674 KB) tahsp_amber03_Protein.itp Ben Reynwar, 01/12/2012 09:27 PM

Related issues

Related to GROMACS - Bug #692: Frequency of checking for inter-simulation signalling is too high for large-scale parallel REMDClosed
Related to GROMACS - Feature #1500: Post-5.0 feature clean-up planNew01/20/2014

Associated revisions

Revision d5bd278b (diff)
Added by Mark Abraham over 3 years ago

Removed unnecessary inter-simulation signalling

Generally, multi-simulation runs do not need to couple the simulations
(discussion at #692). Individual algorithms implemented with
multi-simulations might need to do so, but should take care of their
own details, and now do. Scaling should improve in the cases where
simulations are now decoupled.

It is unclear what the expected behaviour of a multi-simulation should
be if the user supplies any of the possible non-uniform distributions
of init_step and nsteps, sourced from any of .mdp, .cpt or command
line. Instead, we report on the non-uniformity and proceed. It's
always possible that the user knows what they are doing. In
particular, now that multi-simulations are no longer explicitly
coupled, any heterogeneity in the execution environment will lead to
checkpoints and -maxh acting at different time steps, unless a
user-selected algorithm requires that the simulations stay coordinated
(e.g. REMD or ensemble restraints).

In the implementation of signalling, we have stopped checking gs for
NULL as a proxy for whether we should be doing signalling at that
communication phase. Replaced with a helper object in which explicit
flags are set. Added unit tests of that functionality.

Improved documentation of check_nstglobalcomm. mdrun now reports the
number of steps between intra-simulation communication to the
log file.

Noted minor TODOs for future cleanup.

Added some trivial test cases for termination by maxh in normal-MD,
multi-sim and REMD cases. Refactored multi-sim tests to make this
possible without duplication. This is complicated by the way filenames
get changed by mdrun -multi by the former par_fn, so cleaned up the
way that is handled so it can work and be re-used better. Introduced
mdrun integration-test object library to make that build system work a
little better. Made some minor improvements to Doxygen setup for
integration tests.

Fixes #860, #692, #1857, #1942.

Change-Id: I5f7b98f331db801b058ae2b196d79716b5912b09

History

#1 Updated by Mark Abraham almost 8 years ago

I think this issue comes from the lack of neighbour-search steps with infinite cutoffs and the fact that GROMACS has a confused implementation of signalling that doesn't properly separate inter- and intra-simulation communication. I touch on some of those issues in #692.

I believe I have a fix in a private development branch, but it is a mess to extract it from the REMD upgrade in that branch.

In the meantime, if you could supply the .mdp, .top and .gro files from which the above .tpr files are made, then I can test whether I do fix this issue. If so, then I'll be able to publish a general fix.

#3 Updated by Mark Abraham almost 8 years ago

Thanks, but it seems tahsp_amber03_Protein.itp is needed.

If you could re-run grompp using the -pp option, that will generate a stand-alone .top that is easy to distribute. Then I can use that with my customized grompp.

#4 Updated by Ben Reynwar almost 8 years ago

Woops. I'm adding the extra itp file, and also the stand-alone tops.

Thanks a bunch for looking at this.

#5 Updated by Rossen Apostolov over 7 years ago

  • Target version set to 4.5.6

Mark, do you want to apply your patch it for 4.5.6?

#6 Updated by Roland Schulz over 7 years ago

What's the status on this? Mark do you have a bugfix? If so is it already on Gerrit?

#7 Updated by Roland Schulz about 7 years ago

  • Assignee set to Mark Abraham

#8 Updated by Mark Abraham almost 7 years ago

  • Target version changed from 4.5.6 to 4.6.1

https://gerrit.gromacs.org/416 should solve this, but as Berk says there, the code needs some battle testing to be sure it's worthy. I have some simulations to run with 4.6, so I'll try it out there and report back for 4.6.1

#9 Updated by Mark Abraham over 6 years ago

  • Status changed from New to Accepted
  • Affected version set to 4.5.5

#10 Updated by Mark Abraham over 6 years ago

  • Status changed from Accepted to In Progress

#11 Updated by Mark Abraham over 6 years ago

  • Target version deleted (4.6.1)

#12 Updated by Rossen Apostolov over 5 years ago

  • Related to Bug #692: Frequency of checking for inter-simulation signalling is too high for large-scale parallel REMD added

#13 Updated by Rossen Apostolov over 5 years ago

  • Target version set to 4.6.x

#14 Updated by Mark Abraham over 5 years ago

#15 Updated by Mark Abraham over 5 years ago

  • Category set to mdrun
  • Status changed from In Progress to Accepted
  • Target version changed from 4.6.x to 5.x

Once we have cleaned out the heuristic group-scheme neighbourlist updates as part of #1500, it will be easy to ensure this behaviour is corrected

#16 Updated by Gerrit Code Review Bot almost 5 years ago

Gerrit received a related patchset '1' for Issue #860.
Uploader: Mark Abraham ()
Change-Id: I5f7b98f331db801b058ae2b196d79716b5912b09
Gerrit URL: https://gerrit.gromacs.org/4312

#17 Updated by Mark Abraham over 3 years ago

  • Status changed from Accepted to Fix uploaded

#18 Updated by Mark Abraham over 3 years ago

https://gerrit.gromacs.org/#/c/5899/11 resolves this issue - there are now tests that show that mdrun multi-simulations write a checkpoint after maxh. It won't be possible to fix any previous release of GROMACS, because of earlier design mistakes.

#19 Updated by Mark Abraham over 3 years ago

  • Target version changed from 5.x to 2016

#20 Updated by Mark Abraham over 3 years ago

  • Status changed from Fix uploaded to Resolved

#21 Updated by Mark Abraham over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF