Project

General

Profile

Bug #2233

replica exchange and -append bugged?

Added by Arthur Voronin over 2 years ago. Updated over 1 year ago.

Status:
Accepted
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
simple
Close

Description

2 out of 14 replica exchange simulations didnt finish on my cluster due to some nodes having performance problems.
I wanted to restart those 2 simulations and append everything to the old files. my command was:

mpirun gmx_mpi mdrun -s rex_.tpr -cpi state_.cpt -append -replex 1000 -multi 100 ...

however I observed that all old files were backed up with #filename# and it looked like a new simulation started, since the step counter began by 0 again. did I do something wrong or is there a bug with the -append option?

Best regards,
Arthur

logs.tar.gz (17.6 MB) logs.tar.gz Arthur Voronin, 08/25/2017 10:43 AM

Related issues

Related to GROMACS - Bug #1889: mdrun -cpi file presence dilemmaRejected
Related to GROMACS - Bug #2173: checkpoint restart with missing restart file is too verboseClosed
Related to GROMACS - Bug #942: -deffnm with enforced rotation code writes files that break checkpoint resumeClosed

History

#1 Updated by Arthur Voronin over 2 years ago

edit:
I have overlooked the warning: No checkpoint file found with -cpi option. Assuming this is a new run.
Can somebody point out the correct way to tell gromacs how to find the checkpoint files for replica exchange in version 2016.3?

#2 Updated by Mark Abraham over 2 years ago

  • Related to Bug #1889: mdrun -cpi file presence dilemma added

#3 Updated by Mark Abraham over 2 years ago

People say they want to be able to write job scripts with

mpirun gmx_mpi mdrun -cpi

that start from the .tpr when there is no checkpoint file written (yet), and from the checkpoint otherwise. I think this is a case of the code trying to be too smart. If the file is missing, the code literally does not know whether the error was that the file is missing, or whether the direction to do a restart was wrong. In the simple case of a single simulation then the "convenient" logic makes sense, but it makes problems when people build more complex workflows.

In particular in your case, if there was no checkpoint present for any simulation, then that simulation will want to start again, but the consistency checks for replica exchange should prevent anything starting. I can't tell what went wrong at the moment - can you share some relevant log files, please?

By construction, replica exchange only writes self-consistent sets of checkpoint files, but parallel file systems often don't actually write files to disk when you tell them to flush to disk (and GROMACS can't do anything more about that, unfortunately), so the set of .cpt files on disk can seem inconsistent. However, we always have a backup .cpt file present, so with judicious use of gmx check you can find a set of files that are consistent by looking at the listed "simulation part" number.

Note that mdrun -multidir is more robust than mdrun -multi because it keeps each simulation's sets of files together without requiring that both mdrun and users use a scheme for managing filenames that matches well in each of the subcases of deffnm, cpi, append, etc.

Accordingly I intend to deprecate mdrun -multi in 2017.

#4 Updated by Mark Abraham over 2 years ago

Note to self - the mdrun -append should also have stopped the simulation proceeding because without a checkpoint it is not meaningful to append. Why did it not work?

#5 Updated by Mark Abraham over 2 years ago

  • Related to Bug #2173: checkpoint restart with missing restart file is too verbose added

#6 Updated by Mark Abraham over 2 years ago

  • Related to Bug #942: -deffnm with enforced rotation code writes files that break checkpoint resume added

#7 Updated by Arthur Voronin about 2 years ago

Mark Abraham wrote:

In particular in your case, if there was no checkpoint present for any simulation, then that simulation will want to start again, but the consistency checks for replica exchange should prevent anything starting. I can't tell what went wrong at the moment - can you share some relevant log files, please?

I appended a few log files. Note that I do have all .cpt files (state0.cpt to state59.cpt) present, as well as the _prev.cpt files. After seeing that it started a new simulation instead of appending I stopped the second run. Also note that I didnt use -deffnm for my first run, I might just have some sort of syntax problem here when telling gromacs to append?

By construction, replica exchange only writes self-consistent sets of checkpoint files, but parallel file systems often don't actually write files to disk when you tell them to flush to disk (and GROMACS can't do anything more about that, unfortunately), so the set of .cpt files on disk can seem inconsistent. However, we always have a backup .cpt file present, so with judicious use of gmx check you can find a set of files that are consistent by looking at the listed "simulation part" number.

I checked the .cpt files via gmx check and the time frames were either 205456.047, 205456.016 or 205456.000.
Im not sure if this has to be the same time for all replica or if this is still considered as consistent.

Best regards,
Arthur

#8 Updated by Mark Abraham over 1 year ago

  • Status changed from New to Accepted

Thanks for the log files, and sorry for the long delay. AFAIK those times in the checkpoint files should all be identical, so I think we have at least one issue to fix.

Also available in: Atom PDF