Bug #306

Discrepancies between 3.3.X and 4.0.X with energy/temperature drifts and premature termination of MD runs in 4.0.X

Added by Pietro Amodeo about 11 years ago. Updated about 11 years ago.

Erik Lindahl
Target version:
Affected version - extra info:
Affected version:


Gromacs 4.0.3 and 4.0.4 versions fail to reproduce MD equilibration runs on two
different systems (proteins solvated in SPC water + counterions), on which 3.3.1 or 3.3.3 succeeded (the latter compiled on the same cluster as 4.0.X, using Intel compiler, double precision and OpenMPI parallel support).

On one system (System1) all 4.0.X simulations terminated after about 2,000 steps, showing large fluctuations in temperature and a drift in total energy, until the system explodes (errors in LINCS or 1-4 energy calculation).
On the other system (System2) the simulation terminated without errors, but exhibited temp fluctuations and energy drift similar (only with lower intensity) to those observed in System1, that did not occurred in the corresponding runs performed with 3.3.1 or 3.3.3 on both systems.

The same problems where observed using either:
a) serial or OpenMPI with Infiniband support parallel versions of 4.0.X;
b) single- or double-precision;
c) GCC or Intel compilers;
d) old- (generated with Gromacs 3.3.1 for 8 cores) or new-version tpr topology files.

Only the exact number of time steps before termination changes among the different cases. Parallel calculations were run on 8 cores (corresponding to a single cluster node).

The cluster on which both 4.0.X and 3.3.3 Gromacs version were compiled and run is formed by Dual-CPU Quad-core Opteron nodes with Infiniband connectivity, and has the following configuration:

Cluster: NEW(Infiniband)
Gromacs 4.0.4 / 4.0.3
(CentOS 5)
kernel 2.6.18-53.el5
/ gcc 4.1.2 20070626 (Red Hat 4.1.2-14)
\ icc 10.1 (Build 20070913 Pack.ID: l_cc_p_10.1.008)
fftw 3.2.1
ofed131 - openmpi 1.2.6

The attached compressed tar file xbug.tar.gz contains:

a) System1_newform.tpr
New (4.0.3) format tpr file for system1
b) System1_oldform_8cores.tpr
Old (3.3.1) format tpr file for system1 prepared for 8 cores
c) System2_oldform_8cores.tpr
Old (3.3.1) format tpr file for system2 prepared for 8 cores
d) System1_4.0.4_newtpr_Intel_double_serial.log
Log file of a 4.0.4 failed run on System1 (Intel,serial,double-prec)
e) System1_4.0.4_newtpr_Intel_double_serial_nohup.log
nohup.log file corresponding to d), contains warnings and errors
f) System2_3.3.3_oldtpr_Intel_double_parallel8cores.log
Log file of a 3.3.3 successful run on System2 (Intel,parallel,double-prec)
g) System2_4.0.4_oldtpr_Intel_double_parallel8cores.log
Log file of a 4.0.4 completed run on System2 (Intel,parallel,double-prec)
h) System1_4.0.4_newtpr_Intel_double_serial.cpt
Checkpoint file produced when 4.0.4 run corresponding to d) log crashed
i) System2_4.0.4_oldtpr_Intel_double_parallel8cores.cpt
Final checkpoint file for 4.0.4 run corresponding to g) log
j) System2_3.3.3_oldtpr_Intel_double_parallel8cores_final.pdb
Final pdb file for 3.3.3 run corresponding to f) log
xbug.tar.gz (18.8 MB) xbug.tar.gz Input (tpr) and output (log and coordinates) files to reproduce/evaluate the bug Pietro Amodeo, 03/16/2009 03:04 PM


#1 Updated by Pietro Amodeo about 11 years ago

Created an attachment (id=357)
Input (tpr) and output (log and coordinates) files to reproduce/evaluate the bug

#2 Updated by Pietro Amodeo about 11 years ago

Additional Notes:
1) Both for ingle prec versions configured with --enable-sse --enable-shared , double prec ones with --enable-sse2 --enable-shared options.

2) Starting energies do not exhibit substantial differences between 3.3.X and 4.0.X runs in all simulations/systems.

#3 Updated by Berk Hess about 11 years ago

My guess would be that this is not a bug, but the result
of a bug fix in 4.0.
In 3.3 and older version tau_p was scaled with the pressure
factor, which is 16.6.
I have corrected this in 4.0, which is also described in the release notes:

So to get to the same results in 4.0 as in 3.3, you should multiply
tau_p with 16.6: 0.5*16.6 = 8.3 ps.

Please try this and report back if this solved the problem or not.


#4 Updated by Pietro Amodeo about 11 years ago

As correctly guessed by Berk, instabilities in 4.0.X trajectories derived from tau_p value, that must be scaled to recover the 3.3.3 behaviour.

I'm sorry, but, although I read about the change in the release notes when 4.0 was released, I forgot this change months later when resubmitting a 3.3.X run!

Maybe a warning in grompp or mdrun or something like a "Major changes from previous release" section somewhere in Manuals could help in preventing such problems.


Also available in: Atom PDF