Project

General

Profile

Bug #1342

Appended .edr files and NaN

Added by Justin Lemkul about 4 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I recently did a short simulation for 100 ps and extended it to 500 ps with tpbconv and mdrun -cpi. The simulation ran perfectly fine, but when extracting energies from the .edr file, the RMSD values are all reported as nan. If one runs g_energy -b 100 or g_energy -e 100, proper output is produced, indicating to me that the point at which the files was appended is the issue. gmxcheck reports nothing odd about the .edr or .trr files from the run, so I suspect something funny with the appending mechanism. Example .edr file attached.

echo 6 0 | g_energy -f npt.edr -o energy.xvg
...
Energy                      Average   Err.Est.       RMSD  Tot-Drift
-------------------------------------------------------------------------------
Potential                  -20287.3         17       -nan    77.0694  (kJ/mol)
echo 6 0 | g_energy -f npt.edr -o energy.xvg -b 100
...
Energy                      Average   Err.Est.       RMSD  Tot-Drift
-------------------------------------------------------------------------------
Potential                  -20275.9         12     145.41    35.5541  (kJ/mol)
npt.edr (587 KB) npt.edr Justin Lemkul, 09/23/2013 07:35 PM
redmine1342.tgz (18.9 KB) redmine1342.tgz David van der Spoel, 06/19/2014 12:42 PM
repro-case.tgz (1.14 MB) repro-case.tgz Mark Abraham, 06/24/2015 11:56 AM

Associated revisions

Revision 83283df8 (diff)
Added by Berk Hess over 2 years ago

Fix g_energy average/RMSD bug

Made g_energy produce correct output for energy files from continued
and appended runs with nstcalcenergy=nstenergy. In that case a count
in the energy file is/was incorrect, but that entry is actually not
necessary for determining the average and RMSD of energy terms. The
RMSD would be NaN and the average would be off in the last decimal.

Fixes #1342.

Change-Id: I82007bfe508023e1c1e17366e10f76bc4470d238

Revision c623d326 (diff)
Added by Erik Lindahl over 2 years ago

Fix binary exact continuation for trajectories

The initial call to compute_globals() after
continuation would remove COM motion, which meant
trajectories would not be exactly (binary)
identical to a single trajectory - it was likely
introduced with the velocity verlet code.

Refs #1342.

Change-Id: I061648774c7625810a8b693c631c61ae377ba297

Revision 3abd8dfd (diff)
Added by Erik Lindahl over 2 years ago

Fix edr appending and exact continuation

The nsteps field was not written to checkpoint files when
nstcalcenergy=nstenergy. This caused differences in nsteps
in appended energy files, which in turn caused issues in averages
and RMSD in g_energy (which is now fixed by another patch).
Also added the nsteps_sim field to the checkpoint file for
consistency.

Fixes #1342.

Change-Id: Iff8bf51aaa307a379f0e7cdb7d76d9bafb13cf13

History

#1 Updated by David van der Spoel about 4 years ago

For some reason all the internal variables are not updated and hence for all time points we have:

time:   4.99800e+02         step:        249900
nsteps: 100
delta_t: 2.00000e-03 sum steps: 0

note: nsteps = 100, sum steps = 0.

Weird.

#2 Updated by Mark Abraham about 4 years ago

  • Target version changed from 4.6.4 to 4.6.x

#3 Updated by Rossen Apostolov over 3 years ago

  • Priority changed from Normal to High

#4 Updated by David van der Spoel over 3 years ago

The problem is in mdrun which replaces the last frame in the edr file with incorrect nsteps. The logic is apparently such that when the energy file is rewritten, the last frame is replaced by the new energies - even though they are the same. A solution would be to skip writing the first frames to energy files in case we're appending.

The problem can be reproduced using the tarball as follows:

grompp
mdrun
gmxdump -e ener.edr > original
tpbconv -s topol.tpr -until 0.5 -o cont
mdrun -s cont -cpi state
gmxdump -e ener.edr > broken

And comparing the results in original and broken:
% grep nsteps broken
nsteps: 1
nsteps: 10
nsteps: 10
nsteps: 10
nsteps: 10
nsteps: 1 <<< should be 10 as well
nsteps: 10

Now the question is whether it is sufficient to fix mdrun or whether we should make a workaround in g_energy too?
The problem is still present in master, so all continuations using tpbconv will be broken.

#5 Updated by Erik Lindahl over 3 years ago

  • Priority changed from High to Low
  • Target version changed from 4.6.x to 5.x

While I do see the problem, this appears to be the result of first altering the TPR file (creating a new one) with tpbconv, and then still appending to the previous run as if it was a straight continuation.

The append functionality is primarily intended to be able to continue exactly the same simulation from exactly the same file, and we have never tested it to be a kitchen-sink to replace all post-simulation concatenation tools :-) In particular, there are a ton of things that can go wrong (like this one), so my main concern is that it works 100% for the vanilla case; it might be a good idea to simply disallow appending if the TPR file is not identical to avoid bugs.

Unless a volunteer steps up, I suspect this isn't going to make 5.0.

#6 Updated by David van der Spoel over 3 years ago

  • Priority changed from Low to Normal

Nevertheless, the difference between the original and the continuation file is just the number of steps

% gmx check -s1 topol.tpr -s2 cont.tpr
inputrec->nsteps (50 - 250)

The other alternative to extending a simulation is

% gmx mdrun -nsteps 250 -cpi state -s topol

It has the same problem.

Finally, what the appending mechanism was made for, to extend a simulation that was stopped (e.g. by queueing system) ALSO has this problem. In other words any appended simulation has this error in the energy file.

#7 Updated by David van der Spoel over 3 years ago

  • Assignee changed from Berk Hess to David van der Spoel

#8 Updated by Roland Schulz over 3 years ago

What is the reason nsteps is written to the edr file? According to the t_enxframe documentation it is "The number of steps between frames". That is a quantity which is has to be wrong as soon as you append or concatenate. Why is that quantity written at all and not computed when reading the file?

#9 Updated by David van der Spoel over 3 years ago

In principle nsteps could be used for exact calculation of averages and variances. The issue is that (I think) mdrun replace the last frame of the energy file with one from the new run, however when writing it has then only done one step rather than what is was before. So in fact the continuation run should start at step 1 rather than 0. This may have other issues though.

The workaround in gmx energy will take care of this, but it mdrun needs to be fixed
too.

#10 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #1342.
Uploader: Erik Lindahl ()
Change-Id: I061648774c7625810a8b693c631c61ae377ba297
Gerrit URL: https://gerrit.gromacs.org/4761

#11 Updated by Erik Lindahl over 2 years ago

  • Status changed from New to In Progress
  • Target version changed from 5.x to 4.6.8

The continuation code is delicate, not because binary continuation itself is difficult, but because we have added way too many different ways to continue trajectories - most of these will have to go, or we will keep re-introducing bugs in working code when altering those options.

The above patch only does a first step and restores binary exact continuation for trajectories and all the energy data itself (it had been screwed up by adding a call to center-of-mass removal at the first continuation step).

The nsteps data is correctly written to the energy history (with nsteps==9 for the last checkpoint), but for some reason this is then reset after reading the checkpoint again. Just for the future: It is critical for the checkpointing to work this way and re-write the last energy frame, so this must NOT be changed. The most common setup is that we use checkpoints when things crashed, and then there might not have been an energy frame written at the same step as the checkpoint, or the crash might even have occurred while writing the energy file.

#12 Updated by Berk Hess over 2 years ago

This is an issue with the way the fluctuation formula is updated at continuation. This is done incorrectly. I looked at this some years ago, but couldn't quickly see what was wrong. This only affects the fluctuations.
I think this should be a one line fix in energy history setup at continuation. But someone needs to go through the math and the variables in the code.

#13 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #1342.
Uploader: Erik Lindahl ()
Change-Id: Iff8bf51aaa307a379f0e7cdb7d76d9bafb13cf13
Gerrit URL: https://gerrit.gromacs.org/4765

#14 Updated by Erik Lindahl over 2 years ago

  • Status changed from In Progress to Fix uploaded

#15 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #1342.
Uploader: Berk Hess ()
Change-Id: I82007bfe508023e1c1e17366e10f76bc4470d238
Gerrit URL: https://gerrit.gromacs.org/4767

#16 Updated by Mark Abraham over 2 years ago

As noted at https://gerrit.gromacs.org/#/c/4761/:

I ran some tests comparing release-4-6 HEAD to patch set 2 of https://gerrit.gromacs.org/#/c/4765/2 (output at /nethome/mark/redmines/redmine-1342/test-continuation-r46) with mdrun -reprod, even with fftpack. Both LF and VV write different virials to the .edr file at step 20 depending on whether they stop at step 20 or continue within the same run to 40. So questions of appending and continuations have not yet arisen.

In a continuation from step 20, LF then arrives identically at step 40, but the resulting concatenated .edr file differs at step 20 as above. Haven't looked yet to see whether this is a COMM or constraints issue. Formally this commit message seems correct for LF, but I think users might not believe us...

In a continuation from step 20, VV diverges from step 20.

Tarball of grompp inputs attached

#17 Updated by Erik Lindahl over 2 years ago

Mark, was the final issue you observed fixed by https://gerrit.gromacs.org/#/c/4778/ (which is now merged)?

#18 Updated by Mark Abraham over 2 years ago

Erik Lindahl wrote:

Mark, was the final issue you observed fixed by https://gerrit.gromacs.org/#/c/4778/ (which is now merged)?

I expect so, will try it out.

My observations of divergence were either wrong or confounded. eneconv concatenation can't work with VV because of backward compatibility hacks for GROMACS 3.x writing frames with init-step 0, or something like that, and LF writes duplicate frames whereas VV does not... this stuff is ridiculous!) Fortunately re-writing .edr output to go to TNG lets us get rid of this crap.

#19 Updated by Berk Hess over 2 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#20 Updated by David van der Spoel about 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF