Appended .edr files and NaN
I recently did a short simulation for 100 ps and extended it to 500 ps with
mdrun -cpi. The simulation ran perfectly fine, but when extracting energies from the .edr file, the RMSD values are all reported as
nan. If one runs
g_energy -b 100 or
g_energy -e 100, proper output is produced, indicating to me that the point at which the files was appended is the issue.
gmxcheck reports nothing odd about the .edr or .trr files from the run, so I suspect something funny with the appending mechanism. Example .edr file attached.
echo 6 0 | g_energy -f npt.edr -o energy.xvg ... Energy Average Err.Est. RMSD Tot-Drift ------------------------------------------------------------------------------- Potential -20287.3 17 -nan 77.0694 (kJ/mol)
echo 6 0 | g_energy -f npt.edr -o energy.xvg -b 100 ... Energy Average Err.Est. RMSD Tot-Drift ------------------------------------------------------------------------------- Potential -20275.9 12 145.41 35.5541 (kJ/mol)
Fix g_energy average/RMSD bug
Made g_energy produce correct output for energy files from continued
and appended runs with nstcalcenergy=nstenergy. In that case a count
in the energy file is/was incorrect, but that entry is actually not
necessary for determining the average and RMSD of energy terms. The
RMSD would be NaN and the average would be off in the last decimal.
Fix binary exact continuation for trajectories
The initial call to compute_globals() after
continuation would remove COM motion, which meant
trajectories would not be exactly (binary)
identical to a single trajectory - it was likely
introduced with the velocity verlet code.
Fix edr appending and exact continuation
The nsteps field was not written to checkpoint files when
nstcalcenergy=nstenergy. This caused differences in nsteps
in appended energy files, which in turn caused issues in averages
and RMSD in g_energy (which is now fixed by another patch).
Also added the nsteps_sim field to the checkpoint file for
#4 Updated by David van der Spoel over 3 years ago
- File redmine1342.tgz redmine1342.tgz added
- Category changed from analysis tools to mdrun
- Assignee changed from David van der Spoel to Berk Hess
The problem is in mdrun which replaces the last frame in the edr file with incorrect nsteps. The logic is apparently such that when the energy file is rewritten, the last frame is replaced by the new energies - even though they are the same. A solution would be to skip writing the first frames to energy files in case we're appending.
The problem can be reproduced using the tarball as follows:
gmxdump -e ener.edr > original
tpbconv -s topol.tpr -until 0.5 -o cont
mdrun -s cont -cpi state
gmxdump -e ener.edr > broken
And comparing the results in original and broken:
% grep nsteps broken
nsteps: 1 <<< should be 10 as well
Now the question is whether it is sufficient to fix mdrun or whether we should make a workaround in g_energy too?
The problem is still present in master, so all continuations using tpbconv will be broken.
#5 Updated by Erik Lindahl over 3 years ago
- Priority changed from High to Low
- Target version changed from 4.6.x to 5.x
While I do see the problem, this appears to be the result of first altering the TPR file (creating a new one) with tpbconv, and then still appending to the previous run as if it was a straight continuation.
The append functionality is primarily intended to be able to continue exactly the same simulation from exactly the same file, and we have never tested it to be a kitchen-sink to replace all post-simulation concatenation tools :-) In particular, there are a ton of things that can go wrong (like this one), so my main concern is that it works 100% for the vanilla case; it might be a good idea to simply disallow appending if the TPR file is not identical to avoid bugs.
Unless a volunteer steps up, I suspect this isn't going to make 5.0.
#6 Updated by David van der Spoel over 3 years ago
- Priority changed from Low to Normal
Nevertheless, the difference between the original and the continuation file is just the number of steps
% gmx check -s1 topol.tpr -s2 cont.tpr
inputrec->nsteps (50 - 250)
The other alternative to extending a simulation is
% gmx mdrun -nsteps 250 -cpi state -s topol
It has the same problem.
Finally, what the appending mechanism was made for, to extend a simulation that was stopped (e.g. by queueing system) ALSO has this problem. In other words any appended simulation has this error in the energy file.
#8 Updated by Roland Schulz over 3 years ago
What is the reason nsteps is written to the edr file? According to the t_enxframe documentation it is "The number of steps between frames". That is a quantity which is has to be wrong as soon as you append or concatenate. Why is that quantity written at all and not computed when reading the file?
#9 Updated by David van der Spoel over 3 years ago
In principle nsteps could be used for exact calculation of averages and variances. The issue is that (I think) mdrun replace the last frame of the energy file with one from the new run, however when writing it has then only done one step rather than what is was before. So in fact the continuation run should start at step 1 rather than 0. This may have other issues though.
The workaround in gmx energy will take care of this, but it mdrun needs to be fixed
#11 Updated by Erik Lindahl over 2 years ago
- Status changed from New to In Progress
- Target version changed from 5.x to 4.6.8
The continuation code is delicate, not because binary continuation itself is difficult, but because we have added way too many different ways to continue trajectories - most of these will have to go, or we will keep re-introducing bugs in working code when altering those options.
The above patch only does a first step and restores binary exact continuation for trajectories and all the energy data itself (it had been screwed up by adding a call to center-of-mass removal at the first continuation step).
The nsteps data is correctly written to the energy history (with nsteps==9 for the last checkpoint), but for some reason this is then reset after reading the checkpoint again. Just for the future: It is critical for the checkpointing to work this way and re-write the last energy frame, so this must NOT be changed. The most common setup is that we use checkpoints when things crashed, and then there might not have been an energy frame written at the same step as the checkpoint, or the crash might even have occurred while writing the energy file.
#12 Updated by Berk Hess over 2 years ago
This is an issue with the way the fluctuation formula is updated at continuation. This is done incorrectly. I looked at this some years ago, but couldn't quickly see what was wrong. This only affects the fluctuations.
I think this should be a one line fix in energy history setup at continuation. But someone needs to go through the math and the variables in the code.
#16 Updated by Mark Abraham over 2 years ago
As noted at https://gerrit.gromacs.org/#/c/4761/:
I ran some tests comparing release-4-6 HEAD to patch set 2 of https://gerrit.gromacs.org/#/c/4765/2 (output at /nethome/mark/redmines/redmine-1342/test-continuation-r46) with mdrun -reprod, even with fftpack. Both LF and VV write different virials to the .edr file at step 20 depending on whether they stop at step 20 or continue within the same run to 40. So questions of appending and continuations have not yet arisen.
In a continuation from step 20, LF then arrives identically at step 40, but the resulting concatenated .edr file differs at step 20 as above. Haven't looked yet to see whether this is a COMM or constraints issue. Formally this commit message seems correct for LF, but I think users might not believe us...
In a continuation from step 20, VV diverges from step 20.
Tarball of grompp inputs attached
#18 Updated by Mark Abraham over 2 years ago
Erik Lindahl wrote:
Mark, was the final issue you observed fixed by https://gerrit.gromacs.org/#/c/4778/ (which is now merged)?
I expect so, will try it out.
My observations of divergence were either wrong or confounded. eneconv concatenation can't work with VV because of backward compatibility hacks for GROMACS 3.x writing frames with init-step 0, or something like that, and LF writes duplicate frames whereas VV does not... this stuff is ridiculous!) Fortunately re-writing .edr output to go to TNG lets us get rid of this crap.