Output control parameters have an effect on energies
Distinct momentary increases in the total energy are observed at intervals which
are dependent on output frequency of coordinates for example.
Double precision simulations were performed for crystalline surface consisting
of infinite polymer chains. Positions restraints were used for some atoms (160
of 46080 atoms) to maintain "original crystal shape" and option pbc=full due to
the infinite chains. After equilibration simulation in NVT ensemble with 2
processors, NVE simulations with 4 processors show dependence of the total
energy on output frequency.
Multiple simulations with different values of nstxout (nstxout=nstvout=nstlog =
10, 100 or 1000 and nstenergy = 1) give different kind of energy curves. Energy
values are equal at the beginning but deviated just after the time step for
which output is written to .trr or .log file. Total energy always increases
sharply at that moment and equilibrates to the corresponding value until next
sharp increase is observed when output is written to .trr or .log file next
time. Finally these increases in the total energy will lead to the
termination/explosion of the simulation.
Actually drift in total energy in simulation with 4 processors, even without
output to .trr or .log files, is already at the beginning of the simulation
(<1ps) larger than corresponding drift in 50ps simulation with 2 processors.
A little bit more about the problem can be found from discussion with Berk:
This is the first time I am using this and I suppose that I can add my files
later but if not I really would like to send the files which can be used to
reproduce this anomalous behavior.
#10 Updated by Berk Hess over 12 years ago
I have reproduced the problem.
It is not related to the output control, but it indirectly shows up there.
The system runs correctly on 1, 2 or 3 nodes, but not on 4.
I suspect the problem is that you have molecules that are split over nodes
and that there are bonded interactions involving atoms that reside
on different nodes, which do not communicate for non-bonded interactions.
Apparently the person who coded the communication did not think of this
situation and therefore some coordinates for bonded interactions are not
correct. These "not communicated" coordinates are communicated at nstxout,
which results in the jumps.
We will try to solve this problem.
In the meantime you can run on 1, 2 or 3 nodes.
#12 Updated by David van der Spoel over 12 years ago
I can not run the examples you sent in gromacs 3.3 due to LJ14 out of range. It
seems though that we have found the source of the problem which has to do with
that not all coordinates necessary to calculate the bonded interactions were
passed around when on 4 processors. This is a general problem, which however
only will show up in special cases like this one where you have one single
molecule. A protein in water will not have this problem. The workaround that
Berk mentions works because it ensures that the atomnumbering is more sequential.
Also, your commands.txt is unintelligable. A binary file?
#15 Updated by Janne Hirvi over 12 years ago
I have constructed surface with different program by duplicating a unit cell or
actually a part of the surface and so atom numbering run in these parts and not
along the polymer chains. Only atoms in the same charge group (CH2 or CHCl) have
sequential atom numbers.
Do you think that renumbering along polymer chains would increase performance
significantly when I use also pretty time consuming PME3dc method for Coulombic
interactions. If so I have to think about that but actually in this test case I
can manage with 3 processors and accept that I don't have perfect performance.
By the way, is there any simple implemented way to rearrange atom numbers or
should I just code it if I need it?
However it seems that this has nothing to do with overall energy conservation
problems which I have to solve somehow...
#16 Updated by Janne Hirvi over 12 years ago
What is the difference between versions 3.3 and 3.3.1 because Berk also
reproduced the problem with version 3.3.1 but system crashes for you with
I had some problems with attacments but I can open Commands.txt normally through
bugzilla - do you want me to send it to you?
#18 Updated by Janne Hirvi over 12 years ago
Sorry, my expression was misleading. You have proved that renumbering will
remove these momentary increases in total energy when 4 processors are used, but
however regardless of number of processors I observe small drift which I want to
get rid of ("overall energy conservation"). I somehow "hoped" that also this
small drift was due to some kind of communication error in code but now I just
have to try to find out what causes this drift.
Thanks for your help Berk and David!
#23 Updated by Janne Hirvi over 12 years ago
Now I have renumbered atoms along polymer chains and system employs switch
function for LJ interactions. To be able to understand the effect of center of
mass motion removal with position restraints I performed simulations with both
comm_mode options (linear and none).
I first equilibrated the system in NVT ensemble (100ps with single precision)
with both options and everything worked well without any notable difference
between the simulations. Then I tried to continue simulations in NVE ensemble
with double precision, but both simulations always crashed immediately due to
huge VCM. Actually it seems like some kind of overflow because the value has
length of many rows. Simulations with single precision start normally. Any ideas
what might be causing this?
#25 Updated by Janne Hirvi over 12 years ago
Sorry that I bothered you but its running now even with double precision.
I tried to produce .tpr file which you requested but with different computer
than earlier because there was already some other simulations running on those
two computers. I am not yet sure if energy is conserved because simulations are
only at the beginning, but I will report about that as soon as possible.
So the problem wasn't general but there is possible some problems with our
double precision installation. On these computers other energy components for
the starting configuration are realistic (equal to running simulation) but
energies for 1-4 vdW and 1-4 Coulombic interactions are in scale of e^(-300) and
sometimes before these energies there is this huge value for VCM. Do you have
any ideas what could be the problem?
#26 Updated by Janne Hirvi over 12 years ago
The main problem seemed to be illogical atom numbering but also the center of
mass motion removal with position restraints affected the energy conservation.
System with atoms renumbered along polymer chains and COM motion removed
(comm_mode = linear) exhibits soon negative drift (total energy 145000 and drift
-10 in 50ps) but without COM motion removal (comm_mode = none) the total energy
Thanks for your valuable help!