Project

General

Profile

Bug #300

continuation from the checkpoint is not binary identical

Added by Guang-Jun Guo over 10 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Created an attachment (id=356)
compressed file on linux

I expect the continuation simulation by restarting from the checkpoint file would be binary indentical but it is actually not the case. The attachment is my test runs. A 6-step run is performed first, and then it is redone by two parts. The first part is 3 steps starting from the same initial point, and the next 3 steps is restarted from the checkpoint at the third step. I find the trajectories of the 3rd, 4th, 5th, and 6th steps are not binary identical.

My computational environments are Gromacs 4.0.4, lammpi 7.1.4, Redhat Linux, two AMD Opteron CPUs with dual cores in each, 500 SPC water molecules, and the NPT ensemble. During these tests, I have turned off the gen_vel, optimize_fft, and dynamic load balancing (dlb).

The involved discussion on gmx-users was titled as "About the binary identical continuation by restarting from the checkpoint file".

test-continue.tar.gz (1.1 MB) test-continue.tar.gz compressed file on linux Guang-Jun Guo, 02/25/2009 10:40 AM

History

#1 Updated by Erik Lindahl over 10 years ago

I don't remember if we actually use the optimize_fft option internally anymore (that was important when it took a lot of time to optimize ffts in FFTW2). Did you use the -reprod option to mdrun?

Cheers,

Erik

#2 Updated by Berk Hess over 10 years ago

You have nstlist set to 10.
You can not get exact continuation when you terminate a simulation
at a non neighborlist step.
mdrun will by itself always terminate at a neighborsearch step.
Only setting nsteps to something not divisible by nstlist
or typing control-c will get you a checkpoint file at a non-ns step.

Also there is a checkpointing bug with NPT serial runs in version 4.0 - 4.0.3.
And I mailed you before to the mailing list about fftw irreproducibility.

For me this continuation is exact with version 4.0.4.
I made a run of 20 steps, then:
mdrun -cpt 0
cp md.log orig.log
mdrun -cpi state_prev.cpt -append
diff orig.log md.log
This gives no differences in the energies.

I guess your problem was caused by nsteps not being a multiple of nstlist.
I'll have to think if mdrun could properly warn for this.

Berk

#3 Updated by Guang-Jun Guo over 10 years ago

Berk is right. When I set nstlist=1 or 3, the coordinates of all atoms become identical in my test. However, the velocities always show slight differences. These differences make the coordinates of a few atoms occur small difference since the 4th step (here, I prolong the continuation from 3 steps to 10 steps). It should be owing to the chaotic feature of MD. When redoing these runs, I have recompiled Gromacs 4.0.4 according to Berk's suggestion and the –reprod is also used. Can this question be ascribed to my AMD Opteron CPUs? I once heard that the float calculation of AMD CPUs is not as good as that of Intel CPUs. Is it the case?

Guang-Jun

#4 Updated by Berk Hess over 10 years ago

No, Intel or AMD does not matter.

But there is the other issue of fftw that I mailed about.
mdrun of Gromacs 4.0.4 with the option -reprod
should provide binary identical continuation.

Berk

Also available in: Atom PDF