Checkpoint not created upon reaching time given in maxh
I'm having a problem with gromacs not terminating as expected when using the maxh option.
It occurs when doing a REMD simulation with infinite cutoffs.
It does not occur for the first run, but only for the second run that was started from the first run checkpoints.
I have been using 2 processors for each replica.
The version used is 4.5.5 with a bug fix applied from
I'm specifying -maxh 24 and as expected see the following in the stderr output.
Step 773882: Run time exceeded 23.760 hours, will terminate the run
Step 773876: Run time exceeded 23.760 hours, will terminate the run
Step 773880: Run time exceeded 23.760 hours, will terminate the run
However I can see that the output files continued to be written for
another hour until at 25 hours the simulation was terminated by the
No checkpoint files were produced. The output files show that the
simulation continued until about step 797000.
I attach the cpt and tpr files for starting a 2 replica simulation that exhibits this problem.
Removed unnecessary inter-simulation signalling
Generally, multi-simulation runs do not need to couple the simulations
(discussion at #692). Individual algorithms implemented with
multi-simulations might need to do so, but should take care of their
own details, and now do. Scaling should improve in the cases where
simulations are now decoupled.
It is unclear what the expected behaviour of a multi-simulation should
be if the user supplies any of the possible non-uniform distributions
of init_step and nsteps, sourced from any of .mdp, .cpt or command
line. Instead, we report on the non-uniformity and proceed. It's
always possible that the user knows what they are doing. In
particular, now that multi-simulations are no longer explicitly
coupled, any heterogeneity in the execution environment will lead to
checkpoints and -maxh acting at different time steps, unless a
user-selected algorithm requires that the simulations stay coordinated
(e.g. REMD or ensemble restraints).
In the implementation of signalling, we have stopped checking gs for
NULL as a proxy for whether we should be doing signalling at that
communication phase. Replaced with a helper object in which explicit
flags are set. Added unit tests of that functionality.
Improved documentation of check_nstglobalcomm. mdrun now reports the
number of steps between intra-simulation communication to the
Noted minor TODOs for future cleanup.
Added some trivial test cases for termination by maxh in normal-MD,
multi-sim and REMD cases. Refactored multi-sim tests to make this
possible without duplication. This is complicated by the way filenames
get changed by mdrun -multi by the former par_fn, so cleaned up the
way that is handled so it can work and be re-used better. Introduced
mdrun integration-test object library to make that build system work a
little better. Made some minor improvements to Doxygen setup for
#1 Updated by Mark Abraham over 8 years ago
I think this issue comes from the lack of neighbour-search steps with infinite cutoffs and the fact that GROMACS has a confused implementation of signalling that doesn't properly separate inter- and intra-simulation communication. I touch on some of those issues in #692.
I believe I have a fix in a private development branch, but it is a mess to extract it from the REMD upgrade in that branch.
In the meantime, if you could supply the .mdp, .top and .gro files from which the above .tpr files are made, then I can test whether I do fix this issue. If so, then I'll be able to publish a general fix.
#2 Updated by Ben Reynwar over 8 years ago
- File tahsp_amber03_chirre.itp tahsp_amber03_chirre.itp added
- File run0_0.mdp run0_0.mdp added
- File run0_1.mdp run0_1.mdp added
- File tahsp_starting.pdb tahsp_starting.pdb added
- File tahsp_amber03.top tahsp_amber03.top added
- File tahsp_amber03_alphacryst_posre.itp tahsp_amber03_alphacryst_posre.itp added
Here are the additional files requested.
#8 Updated by Mark Abraham over 7 years ago
- Target version changed from 4.5.6 to 4.6.1
https://gerrit.gromacs.org/416 should solve this, but as Berk says there, the code needs some battle testing to be sure it's worthy. I have some simulations to run with 4.6, so I'll try it out there and report back for 4.6.1