continuation from the checkpoint is not binary identical
Created an attachment (id=356)
compressed file on linux
I expect the continuation simulation by restarting from the checkpoint file would be binary indentical but it is actually not the case. The attachment is my test runs. A 6-step run is performed first, and then it is redone by two parts. The first part is 3 steps starting from the same initial point, and the next 3 steps is restarted from the checkpoint at the third step. I find the trajectories of the 3rd, 4th, 5th, and 6th steps are not binary identical.
My computational environments are Gromacs 4.0.4, lammpi 7.1.4, Redhat Linux, two AMD Opteron CPUs with dual cores in each, 500 SPC water molecules, and the NPT ensemble. During these tests, I have turned off the gen_vel, optimize_fft, and dynamic load balancing (dlb).
The involved discussion on gmx-users was titled as "About the binary identical continuation by restarting from the checkpoint file".
#2 Updated by Berk Hess over 10 years ago
You have nstlist set to 10.
You can not get exact continuation when you terminate a simulation
at a non neighborlist step.
mdrun will by itself always terminate at a neighborsearch step.
Only setting nsteps to something not divisible by nstlist
or typing control-c will get you a checkpoint file at a non-ns step.
Also there is a checkpointing bug with NPT serial runs in version 4.0 - 4.0.3.
And I mailed you before to the mailing list about fftw irreproducibility.
For me this continuation is exact with version 4.0.4.
I made a run of 20 steps, then:
mdrun -cpt 0
cp md.log orig.log
mdrun -cpi state_prev.cpt -append
diff orig.log md.log
This gives no differences in the energies.
I guess your problem was caused by nsteps not being a multiple of nstlist.
I'll have to think if mdrun could properly warn for this.
#3 Updated by Guang-Jun Guo over 10 years ago
Berk is right. When I set nstlist=1 or 3, the coordinates of all atoms become identical in my test. However, the velocities always show slight differences. These differences make the coordinates of a few atoms occur small difference since the 4th step (here, I prolong the continuation from 3 steps to 10 steps). It should be owing to the chaotic feature of MD. When redoing these runs, I have recompiled Gromacs 4.0.4 according to Berk's suggestion and the –reprod is also used. Can this question be ascribed to my AMD Opteron CPUs? I once heard that the float calculation of AMD CPUs is not as good as that of Intel CPUs. Is it the case?