Project

General

Profile

Bug #2318

Exact checkpoint restarts are not exact in 2018 beta1

Added by Erik Lindahl about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Commit 76920a4f breaks binary exact restarts, which in turn prevents debugging of other patches.

I presume this is because the dynamic pruning keeps a state across the checkpointing that is not saved. To reproduce, make two copies of the attacted tpr, use the -reprod flag, run one of them to completion (2000 steps) and interrupt/continue the other one. Prior to 76920a4f, gmx check will confirm the contents is identical, but not after.

There are three obvious options:

1) Reset the pruning so no state is kept at checkpoint steps
2) Save the state to the checkpoint
3) Disable dynamic pruning on CPUs

Hopefully (1) should not be too difficult, since I assume (2) is expensive.

However... it is quite embarrassing that we have now broken exact restarts in every single beta the last ~3 versions. This really shows the importance of testing checkpointing restarts for binary identity ANY time we do anything related either to any state or checkpoint I/O.

run.tpr (354 KB) run.tpr Erik Lindahl, 12/01/2017 08:22 PM
short.tpr (10.4 MB) short.tpr Erik Lindahl, 12/02/2017 11:21 AM
short.log (24.7 KB) short.log Erik Lindahl, 12/02/2017 11:22 AM

Associated revisions

Revision 50a7265f (diff)
Added by Berk Hess about 2 years ago

Only stop at nstlist steps with -reprod

Stopping mdrun with two INT or TERM signals would always happen right
after the first global communication step. But this breaks exact
continuation. Now with mdrun -reprod a second signal will still stop
at a pair-list generation step, like with the first signal, so we can
still have exact continuation.

Refs #2318

Change-Id: If65c1215d2509d60c1c5a6444769e7809288e798

Revision 700d6f38 (diff)
Added by Berk Hess about 2 years ago

Fix DD exact continuation bug

With domain decomposition the local atom density, used for setting
the search grid for sorting particles, was based on the local atom
count including atoms/charge groups that would be moved to
neighboring cells. This lead do a different density value, which in turn
could result in a different number of search grid cells and thus
a different summation order during a run versus when continuing a run
from checkpoint, when no atoms would be moved. Now exact continuation
is guaranteed for the domdec module with the mdrun -reprod option.

Refs #2318

Change-Id: I78452c7dfcf3ca6bdee63ece3795efc7e4ac467f

History

#1 Updated by Berk Hess about 2 years ago

How do you conclude this is the dynamic pruning and change 76920a4f? Checkpointing happens at search steps, do dynamic pruning should not affect exact continuation.
I get exact continuation with a single thread-mpi rank, -reprod and no GPU. I do not get exact continuation when changing to -ntmpi 6. I also don't get reproducibility with c5923ece, one change before 76920a4f. I don't know why we don't have reproducibility with DD, but it doesn't seem to be related with the dynamic pruning.
I we think exact reproducibility is important, we should have a regression test for that.

#2 Updated by Berk Hess about 2 years ago

For a water box I get exact continuation with c5923ece with RF, but not with PME. This indicates the issue might be in PME, in which case FFTW (optimization) is the likely culprit.

#3 Updated by Berk Hess about 2 years ago

Now I do seem to have a run with RF (and DD) with reproducibility before dynamic pruning and not after. I'll investigate.
Note that there still seems to be a FFTW reproducibility issue.

#4 Updated by Berk Hess about 2 years ago

I also get non-exact continuation with the release-2016 branch on a water box with RF and -ntmpi 6. So it already is broken there. The dynamic pruning change only seems to increase the chance of differences occurring.

#5 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2318.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I78452c7dfcf3ca6bdee63ece3795efc7e4ac467f
Gerrit URL: https://gerrit.gromacs.org/7266

#6 Updated by Berk Hess about 2 years ago

  • Subject changed from Dynamic pruning breaks exact checkpoint restarts to Domain decomposition breaks exact checkpoint restarts
  • Category set to mdrun
  • Status changed from New to Fix uploaded
  • Affected version - extra info set to probably any version with DD
  • Affected version changed from 2018 to 2016

This seems to be a general issue in the domain decomposition that has always been present. Because dynamic pruning increases the cut-off, it is somewhat more likely to trigger the issue.
I uploaded a fix to release-2018, but we could also consider this for release-2016.

#7 Updated by Erik Lindahl about 2 years ago

Thanks Berk - sorry, I had to go offline to prepare a PRACE scientific case presentation yesterday evening. My only reason for blaming dynamic pruning is that it bisected to that change, and the comment didn't include anything else :-)

It would indeed be good to test continuation more, but to do it properly we should interrupt a simulation rather than quit gracefully, and I'm not sure if that is possible in the present unit tests. There is also the issue that we have a ton of different setups that should be tested. For the future, I think the only way to handle that is to require that all data structures are classes where we can test invariance by saving/restoring the state and testing the class for identity.

#8 Updated by Erik Lindahl about 2 years ago

I still don't get binary reproducibility even for a single thread when testing on dev-purley02. I've uploaded a tpr for a shorter run, just so you have exactly the same file as I tried, and the complation settings are in the corresponding log.

Steps to repeat:

  1. Copy short.tpr to cont.tpr
  2. Run one gmx mdrun -v -deffnm short -nt 1 -reprod // let this finish
  3. gmx mdrun -v -deffnm cont -nt 1 -reprod -cpi cont // interrupt and restart a few times
  4. gmx check -f short.trr -f2 cont.trr

#9 Updated by Erik Lindahl about 2 years ago

Update: Did more testing with single-threaded runs, and it appears earlier changes are affected there too. Will do a new bisect, but for now: Don't focus on 76920a4f.

#10 Updated by Erik Lindahl about 2 years ago

  • Subject changed from Domain decomposition breaks exact checkpoint restarts to Dynamic pruning change breaks exact checkpoint restarts for non-parallel runs
  • Affected version - extra info deleted (probably any version with DD)

There seem to be several (separate?) issues.

I did the original bisect on my Mac, and that was correct, but I should have clarified that it was intentionally single-threaded to avoid MPI issues.

I just reproduced this on dev-purley02, using gcc-7.1 and compiling with "cmake -DGMX_MPI=OFF -DGMX_OPENMP=OFF -DGMX_FFT_LIBRARY=FFTPACK". The change just before the dynamic pruning works perfectly fine and produces binary exact checkpoint restarts.

This breaks with the dynamic pruning change (76920a4f), and should not have anything to do with domain decomposition since there is no parallelism whatsoever involved.

Let's focus on that first, and then worry about the reproducibility issues caused by MPI/OpenMP/FFTW.

#11 Updated by Erik Lindahl about 2 years ago

For earlier changes running single-threaded, the issue seems to be FFTW. I'll try to look into that tonight!

#12 Updated by Berk Hess about 2 years ago

I fixed #2016, which was vsite related. I think that bug should not have affected reproducibility, but I'm not 100% sure.

I think we only need to guarantee exact continuation with -reprod with clean termination. But we now only ever terminate at nstlist steps, IIRC. So we should always get exact continuation (unless there are bugs, of course).

#13 Updated by Erik Lindahl about 2 years ago

  • Status changed from Fix uploaded to In Progress

Again: This whole bug report was NOT related to domain decomposition or any parallel run, so let's leave that aside for now.

The whole point of binary exact continuation was to enable restarts no matter how the run was interrupted. Later we have extended things so we try and write an extra checkpoint if we catch a signal, but that is merely a convenience feature, not a reason why extra states thingies can be kept - any checkpoint continuation should be exact with -reprod, unless there are fundamental algorithm reasons that make it impossible.

Have you been able to confirm the bug appearing with this patch for the example tpr file when only using a single thread and FFTPACK?

#14 Updated by Berk Hess about 2 years ago

I tried your recipe on Haswell with the vsite fix, FFTW and a single thread and there I can't reproduce it. Later today I will try to use exactly your setup on the purley machine.

#15 Updated by Berk Hess about 2 years ago

Tried on purley01 no wtih fftpack, gcc7.1, but with tMPI and OpenMP (which shouldn't matter with -nt 1), interrupted the run 6 times, but always get identical results.
But what do you do to interrupt the run?
I press crtl-C once, which makes the run stop at an nstlist step. If you press ctrl-C twice, it might stop at a non-nstlist step and you will then not get exact continuation.

#16 Updated by Berk Hess about 2 years ago

I just realized we should disable quitting and checkpointing before the next nstlist step (when pressing ctrl-C twice) with -reprod.

#17 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2318.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~If65c1215d2509d60c1c5a6444769e7809288e798
Gerrit URL: https://gerrit.gromacs.org/7270

#18 Updated by Berk Hess about 2 years ago

Erik, is this still an issue, or was it the double signal issue?

#19 Updated by Berk Hess about 2 years ago

  • Subject changed from Dynamic pruning change breaks exact checkpoint restarts for non-parallel runs to Exact checkpoint restarts are not exact in 2018 beta1
  • Status changed from In Progress to Resolved

#20 Updated by Mark Abraham about 2 years ago

  • Target version changed from 2018 to 2018-beta2

#21 Updated by Erik Lindahl about 2 years ago

  • Status changed from Resolved to Closed

Berk & I already talked offline, but just for the record: The cause here was likely the much longer neighbor list intervals introduced with the dynamic pruning patch (or something else enabled there), which made exact restarts fail. It is now fixed in release-2018.

It is not entirely trivial to introduce a unit test for this, since we would need signals or process interruption, but one way we could do it is to enable a debug-build only feature where we check for specific environment variables, and just call exit/abort at a specific step to emulate a crash or signal being caught.

Also available in: Atom PDF