Bug #509

Last frame not written when re-starting from a checkpoint

Added by Justin Lemkul almost 10 years ago. Updated almost 10 years ago.

Erik Lindahl
Target version:
Affected version - extra info:
Affected version:


I have run a series of simulations, each 100 ns in length. I do the runs 10 ns at a time, and restart from checkpoints after each segment, i.e.:

tpbconv_4.0.7_s -s ./0_10ns/md_0_10.tpr -o md_10_20.tpr -until 20000
mpirun -np 24 mdrun_4.0.7_gcc_mpi -cpi ./0_10ns/md_0_10.cpt -noaddpart -deffnm md_10_20

tpbconv_4.0.7_s -s ./0_10ns/md_0_10.tpr -o md_20_30.tpr -until 30000
mpirun -np 24 mdrun_4.0.7_gcc_mpi -cpi ./10_20ns/md_10_20.cpt -noaddpart -deffnm md_20_30

What I have noticed is that the trajectory file that is written (.xtc) does not contain the last frame. This is not a big deal for the most part, since, for instance, the first frame of md_10_20.xtc contains the frame at 10 ns. The problem comes at the end of the run. All of my trajectories are now showing only 10000 frames (instead of 10001, since I saved every 10 ps), and end at 99990 ps, instead of 100000. The final .gro file is written correctly, and the .edr file is written properly. It seems that only the .xtc file is affected. I can add back this last frame by converting the .gro file to .xtc and using trjcat, but I would think it would be preferable if mdrun wrote the complete trajectory, since overlapping frames within the trajectory (i.e., at 10000, 20000, etc) are by default over-written with trjcat.


#1 Updated by Justin Lemkul almost 10 years ago

Minor addendum to the above comment: the md_10_20.xtc trajectory is intact, 1001 frames and ending at 20 ns. It is the md_20_30.xtc (and all subsequent trajectories) that are affected by the behavior I described before.

#2 Updated by Berk Hess almost 10 years ago

Are you sure your tpr's actually have nsteps that run up exactly a multiple
of 10000 ps?
In Gromacs 4 (and all versions before) the time step is stored as a real,
thus float when compiled with float. This might give rounding errors around
the times you are using, leading to a few steps more or less than what you want.
If that's the case, you need to use -nsteps for tpbconv.
In 4.5 I changed all time variables to doubles which get rid of this nasty issue.


#3 Updated by Justin Lemkul almost 10 years ago

Thanks, Berk. That appears to be what's going on. The md_10_20.tpr file specifies 10000000 steps, but md_20_30.tpr specifies 14999999 instead of 15000000. I've yet to run extensive simulations with the 4.5 beta series, but it sounds like you've already taken care of this.

#4 Updated by Berk Hess almost 10 years ago

This "bug" in 4.0 and all previous versions has been fixed a long time ago
in git for the 4.5 release.


Also available in: Atom PDF