Project

General

Profile

Bug #1883

many velocity-verlet combinations do not produce exact restarts

Added by Mark Abraham almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
4.6 through to master HEAD
Affected version:
Difficulty:
uncategorized
Close

Description

NVE and Nose-Hoover+MTTK work correctly with an argon box, but all other combinations have some issue. Often the restart step + 1 has KE that is clearly wrong. It seems there are issues with trying to save a KE quantity to the checkpoint, restart, and then do some (unnecessary?) computation before we enter the loop, then skip the first velocity update, then continue normally. It ought to be straightforward to compare two runs in a debugger, but I can't do that right now.

https://gerrit.gromacs.org/#/c/5472 was created to probe this question, and this issue has resulted.

Attached is a tarball that permits reproduction at least as far back as release-4-6 HEAD. Haven't tried 4.5, but my suspicion is that the issue has been there since inception.

testing-continuations.tgz (10.6 MB) testing-continuations.tgz Mark Abraham, 12/24/2015 10:30 AM

Associated revisions

Revision bc0538dc (diff)
Added by Michael Shirts almost 4 years ago

Fixes two problems related to restarts for vv.

The first problem is more serious; in addition to causing problems
with restarts in most cases for vv + either berendsen or v-rescale,
temperature coupling was called twice, which makes the distribution of
kinetic energies too broad (but with the correct average). Affected
vv + either v-rescale berendsen T-controls, other choices were unaffected.

In the second problem, the initial step after restarts with vv +
v-rescale and berendsen temperature control had too high a pressure
because an empty virial matrix that was only filled with MTTK pressure
control. The effects of this bug are very small; it only affected the
volume integration for one step on restarts.

Refs #1883

Change-Id: I8d807c8430bf4c18b0a4800af62d7f10839b7d59

Revision f5061cec (diff)
Added by Mark Abraham almost 4 years ago

Update md-vv tests

This bug fix requires transformAtoB be updated, so also updated the
other test cases that use vv and either v-rescale or Berendsen, in the
interests of ensuring they remain stable.

Refs #1883

Change-Id: I2f57ff35f7790e40131e82bf63359119362a4e25

Revision 849aee34 (diff)
Added by Mark Abraham almost 4 years ago

Backport md-vv fixes

This corrects non-Trotter md-vv temperature-coupling behaviour, and
fixes the first step of md-vv pressure-coupling volume integration
after restarts.

Refs #1883

Change-Id: I8d807c8430bf4c18b0a4800af62d7f10839b7d59

Revision df7f2716 (diff)
Added by Mark Abraham almost 4 years ago

Backport updates for md-vv fixes

Refs #1883

Change-Id: I2f57ff35f7790e40131e82bf63359119362a4e25

Revision efa13a69 (diff)
Added by Mark Abraham about 2 months ago

Add integration tests for exact restarts

These tests demonstrates the extent to which mdrun checkpoint restarts
reproduce the same run that would have taken place without the
restart.

I've been working on these, and the bugs they exposed, for a few
years, but the code has been fixed for a few years now.

The tests don't run with OpenCL because they have caused driver out of
memory issues.

Refs #1137, #1793, #1882, #1883

Change-Id: I8bc441d945f13158bbe10f097e772ea87cc6a559

History

#1 Updated by Michael Shirts almost 4 years ago

OK, I think I found the difference in the restart vs. nonrestart. When restarting from checkpoint, it sets bInitStep = True, even though it's not step=0. This may or may not be correct behavior, but will make a difference between two runs when executing the code (around line 1350 in commit 607b74bfd7, can be found by searching for the comment otherwise).

/* at the start of step, randomize or scale the velocities (trotter done elsewhere) */
if (EI_VV(ir->eI)) {
if (!bInitStep) {
update_tcouple(step, ir, state, ekind, &MassQ, mdatoms);
}

Clearly, this will not be executed during a checkpoint restart, and will be executed otherwise.

Looking further, it appears that on non-restarted steps, md-vv is calling v-rescale twice (here, and in the same place that leapfrog does), which seems problematic, it's certainly not intended. I'm not sure if applying it twice gives incorrect results, or just a different effective tau.

I'll have to think a little about the correct fix. It likely should be implemented immediately after Mark's kinetic energy integrator cleanup.

#2 Updated by Michael Shirts almost 4 years ago

Michael Shirts wrote:

Issues with Berendsen restart seem to come from the fact that it's using the previous step's pressure, which with md-vv, is not correctly defined at step restart+1, since at the restart step, the pressure is calculated with force_vir=0 -- strangely, although it correctly calculates the force vir, when restarting from checkpoint, it uploads and incorrectly save state->fvir_prev. The solution appears to be replacing the line before copying svir_prev over to shake_vir

if (startingFromCheckpoint && (ir->eI=ieVV))

with

if (startingFromCheckpoint && bTrotter)

Technically, md-vv should be able to use the CURRENT pressure, rather than the pressure from one step back. Since berendsen isn't correct anyway, this is likely to be not important enough to address now (it's should be the same amount wrong that berendsen with md is), and should instead be addressed in the main framework.

I will check this code on top of the clean kinetic energy code.

#3 Updated by Michael Shirts almost 4 years ago

Looking further, it appears that on non-restarted steps, md-vv is calling v-rescale twice (here, and in the same place that leapfrog does), which seems problematic, it's certainly not intended. I'm not sure if applying it twice gives incorrect results, or just a different effective tau.

I'll have to think a little about the correct fix. It likely should be implemented immediately after Mark's kinetic energy integrator cleanup.

It appears the correct fix end up simply commenting out the first tcouple. That addresses both the restart issue and the double call.

Incidentally, the double call to v-rescale with md-vv does indeed appear to be somewhat insidious. Running some automated check-ensemble validation, it appears to make the distribution too broad (the average is correct). My checkensemble paper was run on a beta version of 4.5 (I should have included the hash for that commit in the paper so I knew for sure!) and v-rescale was fine; however, the bug appears to have crept in soon after that (4.6.7 appears to be incorrect).

This strongly suggests using checkensemble runs as part of the automated process. With a box of argon, this was easily detectable with a 30 second test.

#4 Updated by Michael Shirts almost 4 years ago

One more bug that probably isn't worth dealing with now -- in all combinations md-vv is not printing out the first restart step to the edr data structure. I seem to recall a discussion of this before; it ends up being a real pain to get this right because of the bookkeeping. All the points match -- it's simply an output quirk. Something, again, to fix with the new integrator structure.

#5 Updated by Mark Abraham almost 4 years ago

Those code fixes seem useful. Please upload when/if ready.

I agree we want some checkensemble in the testing mix. I plan to deploy it after I get some Docker toys running, since such testing requires a release-mode build that has saved the artefacts.

NB I'm not sure when we introduced the behaviour where the log file reports the git hash, but you may get some joy from that in your pre-4.5 log files.

If I get bored on my flight, I'll poke the history to see when the double call to update_tcouple appeared, in case something is instructive.

Yes, I'm working around that .edr quirk in my new infrastructure. There's a lot worse quirks than that around!

#6 Updated by Gerrit Code Review Bot almost 4 years ago

Gerrit received a related patchset '1' for Issue #1883.
Uploader: Michael Shirts ()
Change-Id: I8d807c8430bf4c18b0a4800af62d7f10839b7d59
Gerrit URL: https://gerrit.gromacs.org/5506

#7 Updated by Gerrit Code Review Bot almost 4 years ago

Gerrit received a related patchset '1' for Issue #1883.
Uploader: Mark Abraham ()
Change-Id: I2f57ff35f7790e40131e82bf63359119362a4e25
Gerrit URL: https://gerrit.gromacs.org/5520

#8 Updated by Mark Abraham almost 4 years ago

  • Status changed from New to In Progress
  • Target version set to 5.1.2

Shortly, these fixes will have been applied to all active branches. Am still working on testing coverage, not sure yet if I'm happy with the state of things.

#9 Updated by Mark Abraham over 3 years ago

  • Status changed from In Progress to Resolved
  • Target version deleted (5.1.2)

Things have improved a lot, but there's no known issue to target (for the moment).

#10 Updated by Erik Lindahl over 3 years ago

  • Status changed from Resolved to Closed

Closed, since there is no known issue to target anymore.

Also available in: Atom PDF