Project

General

Profile

Bug #148

Output control parameters have an effect on energies

Added by Janne Hirvi over 12 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
High
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Distinct momentary increases in the total energy are observed at intervals which
are dependent on output frequency of coordinates for example.

Double precision simulations were performed for crystalline surface consisting
of infinite polymer chains. Positions restraints were used for some atoms (160
of 46080 atoms) to maintain "original crystal shape" and option pbc=full due to
the infinite chains. After equilibration simulation in NVT ensemble with 2
processors, NVE simulations with 4 processors show dependence of the total
energy on output frequency.

Multiple simulations with different values of nstxout (nstxout=nstvout=nstlog =
10, 100 or 1000 and nstenergy = 1) give different kind of energy curves. Energy
values are equal at the beginning but deviated just after the time step for
which output is written to .trr or .log file. Total energy always increases
sharply at that moment and equilibrates to the corresponding value until next
sharp increase is observed when output is written to .trr or .log file next
time. Finally these increases in the total energy will lead to the
termination/explosion of the simulation.

Actually drift in total energy in simulation with 4 processors, even without
output to .trr or .log files, is already at the beginning of the simulation
(<1ps) larger than corresponding drift in 50ps simulation with 2 processors.

A little bit more about the problem can be found from discussion with Berk:
1. http://www.gromacs.org/pipermail/gmx-users/2007-May/027227.html
2. http://www.gromacs.org/pipermail/gmx-users/2007-May/027233.html

This is the first time I am using this and I suppose that I can add my files
later but if not I really would like to send the files which can be used to
reproduce this anomalous behavior.

Commands.txt (2.27 KB) Commands.txt Commands to reproduce the problem Janne Hirvi, 05/08/2007 01:03 PM
CrystalPVCSurface.top (2.08 KB) CrystalPVCSurface.top topology for crystal surface Janne Hirvi, 05/08/2007 01:05 PM
mdpFiles.tar.Z (1.58 KB) mdpFiles.tar.Z mdp files for simulations Janne Hirvi, 05/08/2007 01:16 PM
mdpFiles.tar.Z (628 Bytes) mdpFiles.tar.Z mdp files for simulations Janne Hirvi, 05/08/2007 01:26 PM
StartConfigurationGRO.tar.Z (933 KB) StartConfigurationGRO.tar.Z Starting configuration gro-file Janne Hirvi, 05/08/2007 01:42 PM
StartConfigurationTRR.tar.Z (990 KB) StartConfigurationTRR.tar.Z Exact strating configuration trr-file Janne Hirvi, 05/08/2007 01:43 PM
PositionRestraintsGRO.tar.Z (268 KB) PositionRestraintsGRO.tar.Z Coordinates for position restraints Janne Hirvi, 05/08/2007 01:43 PM
CrystalPVCSurfacePART1.itp (3.23 MB) CrystalPVCSurfacePART1.itp Beginning of itp-file Janne Hirvi, 05/08/2007 01:55 PM
CrystalPVCSurfacePART2.itp (4.32 MB) CrystalPVCSurfacePART2.itp End of itp-file Janne Hirvi, 05/08/2007 01:56 PM

History

#1 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=185)
Commands to reproduce the problem

#2 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=186)
topology for crystal surface

#3 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=187)
mdp files for simulations

#4 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=188)
mdp files for simulations

#5 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=189)
Starting configuration gro-file

#6 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=190)
Exact strating configuration trr-file

#7 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=191)
Coordinates for position restraints

#8 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=192)
Beginning of itp-file

#9 Updated by Janne Hirvi over 12 years ago

Created an attachment (id=193)
End of itp-file

#10 Updated by Berk Hess over 12 years ago

I have reproduced the problem.

It is not related to the output control, but it indirectly shows up there.
The system runs correctly on 1, 2 or 3 nodes, but not on 4.

I suspect the problem is that you have molecules that are split over nodes
and that there are bonded interactions involving atoms that reside
on different nodes, which do not communicate for non-bonded interactions.
Apparently the person who coded the communication did not think of this
situation and therefore some coordinates for bonded interactions are not
correct. These "not communicated" coordinates are communicated at nstxout,
which results in the jumps.

We will try to solve this problem.
In the meantime you can run on 1, 2 or 3 nodes.

Berk.

#11 Updated by Berk Hess over 12 years ago

I noticed the atom numbering in your surface does not run along
the polymers, but perpendicalur.
Changing this to the "obvious" numbering probably solves the problem
and will also increase performance.

Berk.

#12 Updated by David van der Spoel over 12 years ago

Janne,

I can not run the examples you sent in gromacs 3.3 due to LJ14 out of range. It
seems though that we have found the source of the problem which has to do with
that not all coordinates necessary to calculate the bonded interactions were
passed around when on 4 processors. This is a general problem, which however
only will show up in special cases like this one where you have one single
molecule. A protein in water will not have this problem. The workaround that
Berk mentions works because it ensures that the atomnumbering is more sequential.

Also, your commands.txt is unintelligable. A binary file?

#13 Updated by David van der Spoel over 12 years ago

This has now been fixed in CVS (for 3.3 and 4.0) although your system still
crashes for me in 3.3 due to another problem that I noted before.

#14 Updated by Berk Hess over 12 years ago

tcsh% file Commands.txt
Commands.txt: MPEG ADTS, layer I, v1, 128 kBits, 44.1 kHz, Stereo

#15 Updated by Janne Hirvi over 12 years ago

I have constructed surface with different program by duplicating a unit cell or
actually a part of the surface and so atom numbering run in these parts and not
along the polymer chains. Only atoms in the same charge group (CH2 or CHCl) have
sequential atom numbers.

Do you think that renumbering along polymer chains would increase performance
significantly when I use also pretty time consuming PME3dc method for Coulombic
interactions. If so I have to think about that but actually in this test case I
can manage with 3 processors and accept that I don't have perfect performance.
By the way, is there any simple implemented way to rearrange atom numbers or
should I just code it if I need it?

However it seems that this has nothing to do with overall energy conservation
problems which I have to solve somehow...

#16 Updated by Janne Hirvi over 12 years ago

What is the difference between versions 3.3 and 3.3.1 because Berk also
reproduced the problem with version 3.3.1 but system crashes for you with
version 3.3?

I had some problems with attacments but I can open Commands.txt normally through
bugzilla - do you want me to send it to you?

#17 Updated by Berk Hess over 12 years ago

Why do you say that renumbering has nothing to do with energy conservation?

It should not, but unfortunately in 3.3.1 it does.
If you renumber, the energy will be conserved.

There is no tool that does this.
You will have to do it yourself.

Berk.

#18 Updated by Janne Hirvi over 12 years ago

Sorry, my expression was misleading. You have proved that renumbering will
remove these momentary increases in total energy when 4 processors are used, but
however regardless of number of processors I observe small drift which I want to
get rid of ("overall energy conservation"). I somehow "hoped" that also this
small drift was due to some kind of communication error in code but now I just
have to try to find out what causes this drift.

Thanks for your help Berk and David!

#19 Updated by Berk Hess over 12 years ago

The system you mailed still uses a plain cut-off for LJ
instead of a shift function.
This will normally lead to a small negative energy drift.

Berk.

#20 Updated by Janne Hirvi over 12 years ago

Actually I observe a small positive drift. I have also tried switch function but
system was at that time equilibriated in NVT ensemble with cut-offs so I still
have to try what hapens if I equilibriate it well with switch (or shift)
function...

#21 Updated by Berk Hess over 12 years ago

I think I have found another problem in your setup.

You are using position restraints, right?
Then you can not use comm removal, since your comm is not free to move.
This might explain your energy drift.

Berk.

#22 Updated by Janne Hirvi over 12 years ago

Thanks Berk!

I will try without COM motion removal.

Janne

#23 Updated by Janne Hirvi over 12 years ago

Now I have renumbered atoms along polymer chains and system employs switch
function for LJ interactions. To be able to understand the effect of center of
mass motion removal with position restraints I performed simulations with both
comm_mode options (linear and none).

I first equilibrated the system in NVT ensemble (100ps with single precision)
with both options and everything worked well without any notable difference
between the simulations. Then I tried to continue simulations in NVE ensemble
with double precision, but both simulations always crashed immediately due to
huge VCM. Actually it seems like some kind of overflow because the value has
length of many rows. Simulations with single precision start normally. Any ideas
what might be causing this?

#24 Updated by David van der Spoel over 12 years ago

Please post your new tpr file such that we can try to reproduce it.

#25 Updated by Janne Hirvi over 12 years ago

Sorry that I bothered you but its running now even with double precision.

I tried to produce .tpr file which you requested but with different computer
than earlier because there was already some other simulations running on those
two computers. I am not yet sure if energy is conserved because simulations are
only at the beginning, but I will report about that as soon as possible.

So the problem wasn't general but there is possible some problems with our
double precision installation. On these computers other energy components for
the starting configuration are realistic (equal to running simulation) but
energies for 1-4 vdW and 1-4 Coulombic interactions are in scale of e^(-300) and
sometimes before these energies there is this huge value for VCM. Do you have
any ideas what could be the problem?

Thanks, Janne

#26 Updated by Janne Hirvi over 12 years ago

The main problem seemed to be illogical atom numbering but also the center of
mass motion removal with position restraints affected the energy conservation.

System with atoms renumbered along polymer chains and COM motion removed
(comm_mode = linear) exhibits soon negative drift (total energy 145000 and drift
-10 in 50ps) but without COM motion removal (comm_mode = none) the total energy
is conserved.

Thanks for your valuable help!

Janne

#27 Updated by David van der Spoel over 12 years ago

Although one could want better error diagnostics, this could be resolved using
"better" input.

Also available in: Atom PDF