Project

General

Profile

Bug #109

artificial forces in membrane simulation, dependence on number of procs

Added by Rainer Böckmann about 13 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
High
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

we found strange forces acting on one/more lipid(s) in a lipid bilayer system
(~21k) running on 8 processors (3.3.1, PME), resulting in one or more lipids
leaving the membrane on the timescale of 10-100ps (stable simulation). This
force acts only on atoms in a specific position of the pdb-file, i.e. exchanging
two lipids will result in a different lipid leaving the membrane. Also, restart
on a different number of processors (e.g. on 4) avoids this artefact, as does
the re-distribution of atoms on the processors using the -load keyword in grompp
(distribution of 'whole' lipids on the processors).

md.mdp (8.86 KB) md.mdp mdp-file Rainer Böckmann, 10/05/2006 04:51 PM
md_8cpu.tpr (3.3 MB) md_8cpu.tpr tpr-file without using -load ... Rainer Böckmann, 10/05/2006 04:52 PM
md_8cpu_load.tpr (3.3 MB) md_8cpu_load.tpr tpr-file using -load "2600 2600 2635 2637 2637 2637 2637 2640" Rainer Böckmann, 10/05/2006 04:55 PM
topol.top (609 Bytes) topol.top topology Rainer Böckmann, 10/05/2006 04:55 PM
popc_128.pdb (1.36 MB) popc_128.pdb initial pdb-file Rainer Böckmann, 10/05/2006 04:56 PM
popc.itp (13.1 KB) popc.itp itp file popc Rainer Böckmann, 10/05/2006 04:56 PM
ffgmx.itp (169 Bytes) ffgmx.itp force field Rainer Böckmann, 10/05/2006 04:57 PM
ffgmxbon.itp (30.2 KB) ffgmxbon.itp ffgmxbon.itp Rainer Böckmann, 10/05/2006 04:58 PM
ffgmxnb.itp (92.9 KB) ffgmxnb.itp ffgmxnb.itp Rainer Böckmann, 10/05/2006 04:58 PM
100f.pdb (1.34 MB) 100f.pdb snapshot with extracted lipid after 100ps Rainer Böckmann, 10/05/2006 05:01 PM
100.pdb (1.34 MB) 100.pdb snapshot after 100ps with modified load, no particularities Rainer Böckmann, 10/05/2006 05:01 PM
md.tpr (21.9 MB) md.tpr See comment #17 Matteo Guglielmi, 05/30/2007 12:08 AM
md.tpr (27.5 MB) md.tpr Linear 32CPUs tpr file Matteo Guglielmi, 06/01/2007 11:43 PM
pbc.c (19.4 KB) pbc.c pbc.c source code David van der Spoel, 06/03/2007 11:13 AM
WAT20439_frame1.jpg (144 KB) WAT20439_frame1.jpg Water molecule before its explosion. Matteo Guglielmi, 06/03/2007 09:52 PM
WAT20439_frame2.jpg (148 KB) WAT20439_frame2.jpg Exploded water molecule Matteo Guglielmi, 06/03/2007 09:55 PM
WAT20439_frame3.jpg (145 KB) WAT20439_frame3.jpg Neighboring water molecules before the explosion Matteo Guglielmi, 06/03/2007 09:59 PM
WAT20439_frame4.jpg (149 KB) WAT20439_frame4.jpg Neighboring water molecules after the explosion Matteo Guglielmi, 06/03/2007 10:02 PM
simple.tgz (45.3 KB) simple.tgz Simple test case David van der Spoel, 06/06/2007 12:18 PM

History

#1 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=78)
mdp-file

#2 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=79)
no load-command

#3 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=80)
grompp ... -load "2600 2600 2635 2637 2637 2637 2637 2640"

grompp -f md.mdp -c popc_128.pdb -p topol.top -np 8 -o md_8cpu_load.tpr -n
-load "2600 2600 2635 2637 2637 2637 2637 2640"

with the tpr-file produced in this way no particularities observed

#4 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=81)
topology

#5 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=82)
initial pdb-file

#6 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=83)
itp file popc

#7 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=84)
force field

#8 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=85)
ff

#9 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=86)
ff

#10 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=87)
snapshot with extracted lipid after 100ps

#11 Updated by Rainer Böckmann about 13 years ago

Created an attachment (id=88)
snapshot after 100ps with modified load, no particularities

#12 Updated by Erik Lindahl about 13 years ago

Hi Rainer,

a) Are you using special communcation hardware (myrinet/infiniband)?
b) What MPI implementation are you using?
c) What FFT library are you using, and did you use any special flags to compile if it is FFTW?
d) Which compiler & version did you use to build gromacs? Any special configure options?

If you used any special hardware and/or libraries, can you reproduce the error with gcc, LAM-MPI, and
standard communication (either ethernet, or using shared memory by running all 8 processes on a
single machine).

Cheers,

Erik

#13 Updated by Rainer Böckmann about 13 years ago

(In reply to comment #12)

Hi Rainer,

a) Are you using special communcation hardware (myrinet/infiniband)?

Myrinet

b) What MPI implementation are you using?

mpich/gm/gcc/64/1.2.6..14b

c) What FFT library are you using, and did you use any special flags to

compile if it is FFTW?
fftw-3.0.1
./configure --enable-float --prefix=/usr/local/fftw-3.0.1 --enable-mpi

d) Which compiler & version did you use to build gromacs? Any special

configure options?
mpich/gm/gcc/64/1.2.6..14b//bin/mpicc
gcc (GCC) 4.0.2 20050901 (prerelease) (SUSE Linux)

no special configure options, standard compilation:
export CPPFLAGS=-I/usr/local/fftw-3.0.1/include
export LDFLAGS=-L/usr/local/fftw-3.0.1/lib
./configure --enable-float --prefix=/usr/local/gromacs-3.3.1 --enable-mpi

If you used any special hardware and/or libraries, can you reproduce the error

with gcc, LAM-MPI, and

standard communication (either ethernet, or using shared memory by running all

8 processes on a

single machine).

we tried it before with the mixed system on a dual-xeon cluster with gigabit
connection (lam-mpi) and ran into the same trouble. We'll check again with the
same popc bilayer system and report the results.

Best,
Rainer

#14 Updated by Rainer Böckmann about 13 years ago

If you used any special hardware and/or libraries, can you reproduce the error

with gcc, LAM-MPI, and

standard communication (either ethernet, or using shared memory by running all

8 processes on a

- we reproduced the error now also on a gigabit dual-xeon cluster (lam-mpi)
cheers,
rainer

#15 Updated by Chris Neale almost 13 years ago

Did you try with a version of mdrun that was compiled with -DEBUG_PBC ?

I have some runs that reliably (but stochastically) give errors about an atom
being found in a grid just one block outside of the expected boundary only in
parallel runs, and often other nodes have log files that indicate that they have
just updated the grid size (constant pressure simulation). This error disappears
when I run with a -DEBUG_PBC version. My assumption here is that there is some
non-blocking MPI communication that is not getting through in time. The
-DEBUG_PBC version spends a lot of time checking some things and although it
never reports having found some problem, I assume that a side-effect of these
extra calculations is to slow things down enough at the proper stage so that the
MPI message gets through. I have solved my problem by adjusting my simulation
cell so that it doesn't fall close to the grid boundaries. Perhaps you are
experiencing some analogous problem?

#16 Updated by Rainer Böckmann almost 13 years ago

Thanks for the hint. Indeed, the grid size is sometimes updated on one or two
nodes. We'll check whether the debug option improves things.
cheers,
rainer

(In reply to comment #15)

Did you try with a version of mdrun that was compiled with -DEBUG_PBC ?

I have some runs that reliably (but stochastically) give errors about an atom
being found in a grid just one block outside of the expected boundary only in
parallel runs, and often other nodes have log files that indicate that they have
just updated the grid size (constant pressure simulation). This error disappears
when I run with a -DEBUG_PBC version. My assumption here is that there is some
non-blocking MPI communication that is not getting through in time. The
-DEBUG_PBC version spends a lot of time checking some things and although it
never reports having found some problem, I assume that a side-effect of these
extra calculations is to slow things down enough at the proper stage so that the
MPI message gets through. I have solved my problem by adjusting my simulation
cell so that it doesn't fall close to the grid boundaries. Perhaps you are
experiencing some analogous problem?

#17 Updated by Matteo Guglielmi over 12 years ago

Same problem here as pointed out by Chris Neale (Comment #6).

Simulation:

- equilibration of protein in water
- position rstraints on
(protein kept frozen in the center of the box - fc = 1000)
- isotropic pressure (berendsen piston - tau_p = 5ps)
- room temperature (berendsen piston - tau_t 0.1 0.4 ps)

Note:

- initial density of water is wrong since the starting
system geometry has been prepared with the amber9 tools.

Results:

- box size gets smaller in x, y and z (decreasing volume size)
- as soon as the Grid gets updated (between 0.2 and 0.5 ns) the
whole simulation crashes because the ci variable has a value barely
out of bounds:

Inertia tensor (3x3):
Inertia tensor[ 0]={ 7.16050e+06, 7.43421e+04, 3.32575e+04}
Inertia tensor[ 1]={ 7.43421e+04, 5.83999e+06, -3.62208e+04}
Inertia tensor[ 2]={-3.32575e+04, -3.62208e+04, 6.14270e+06}
Grid: 16 x 14 x 14 cells
Grid: 16 x 14 x 15 cells
Grid: 16 x 14 x 14 cells
Grid: 16 x 14 x 15 cells
Grid: 16 x 14 x 14 cells
Grid: 16 x 14 x 15 cells
Grid: 16 x 14 x 14 cells
Grid: 16 x 14 x 15 cells
Grid: 16 x 14 x 14 cells
Grid: 16 x 14 x 15 cells
Grid: 16 x 14 x 14 cells
------------------------------------------------------

Program mdrun_mpi, VERSION 3.3.1
Source code file: nsgrid.c, line: 226

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value 3149. It should have been within [ 0 .. 3136 ]
Please report this to the mailing list ()
-------------------------------------------------------

"You Hear Footsteps Coming From Behind" (Colossal Cave)

  1. GROMACS COMPILATION ###

- #1 Full intel 9.1 series compilers (icc, icpc, ifort)

MACHINE CPU TYPE = Intel(R) Xeon(R) CPU 5140 @ 2.33GHz

FFTW-3.1.2
./configure \
CC=icc \
CXX=icpc \
F77=ifort \
FC=ifort \
CFLAGS="-axT -unroll -ip -O3" \
CXXFLAGS="$CFLAGS" \
FFLAGS="$CFLAGS" \
FCFLAGS="$CFLAGS" \
--enable-float \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-intel
make
make install
make distclean
./configure \
CC=icc \
CXX=icpc \
F77=ifort \
FC=ifort \
CFLAGS="-axT -unroll -ip -O3" \
CXXFLAGS="$CFLAGS" \
FFLAGS="$CFLAGS" \
FCFLAGS="$CFLAGS" \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-intel
make
make install

OPENMPI-1.2.2
./configure \
CC=icc \
CXX=icpc \
F77=ifort \
FC=ifort \
CFLAGS="-axT -unroll -ip -O3" \
CXXFLAGS="$CFLAGS" \
FFLAGS="$CFLAGS" \
FCFLAGS="$CFLAGS" \
--prefix=${HOME}/Software/openmpi-1.2.2-intel
make all
make install

GROMACS-3.3.1
export CC=icc
export CXX=icpc
export F77=ifort
export FC=ifort
export CFLAGS="-axT -unroll -ip -O3"
export CXXFLAGS="$CFLAGS"
export FFLAGS="$CFLAGS"
export FCFLAGS="$CFLAGS"
export CPPFLAGS="-I${HOME}/Software/fftw-3.1.2-intel/include"
export LDFLAGS="-L${HOME}/Software/fftw-3.1.2-intel/lib"
./configure \
--enable-double \
--with-fft=fftw3 \
--program-suffix='' \
--enable-fortran \
--prefix=${HOME}/Software/gromacs-3.3.1-intel
make
make install
make distclean
./configure \
--enable-mpi \
--enable-double \
--with-fft=fftw3 \
--program-suffix=_mpi \
--enable-fortran \
--prefix=${HOME}/Software/gromacs-3.3.1-intel
make mdrun
make install-mdrun

- #2 Full gcc 3.4.6 series (gcc, cc, f77)

MACHINE CPU TYPE = Intel(R) Xeon(R) CPU 5140 @ 2.33GHz

FFTW-3.1.2
./configure \
--enable-float \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-gnu
make
make install
make distclean
./configure \
--enable-threads \
--enable-sse2 \
--prefix=${HOME}/Software/fftw-3.1.2-gnu
make
make install

OPENMPI-1.2.2
./configure \
--prefix=${HOME}/Software/openmpi-1.2.2-gnu
make all
make install

GROMACS-3.3.1
export CPPFLAGS="-I${HOME}/Software/fftw-3.1.2-intel/include"
export LDFLAGS="-L${HOME}/Software/fftw-3.1.2-intel/lib"
./configure \
--enable-double \
--with-fft=fftw3 \
--program-suffix='' \
--prefix=${HOME}/Software/gromacs-3.3.1-gnu
make
make install
make distclean
./configure \
--enable-mpi \
--enable-double \
--with-fft=fftw3 \
--program-suffix=_mpi \
--prefix=${HOME}/Software/gromacs-3.3.1-gnu
make mdrun
make install-mdrun

- #3 Full intel 8.0 series compilers (icc, icpc, ifort)

MACHINE CPU TYPE = Intel(R) Xeon(TM) CPU 3.06GHz

FFTW-3.1.2
./configure \
CC=icc \
CXX=icpc \
F77=ifort \
FC=ifort \
--enable-float \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-intel
make
make install
make distclean
./configure \
CC=icc \
CXX=icpc \
F77=ifort \
FC=ifort \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-intel
make
make install

MPICHGM-1.2.7..15
setenv CC icc
setenv FC ifort
setenv F90 ifort
setenv CXX icpc
setenv GM_HOME /opt/gm
setenv PREFIX /home/matteo/Software/mpichgm-1.2.7..15-intel
setenv RSHCOMMAND ssh
setenv CFLAGS "-I$GM_HOME/binary/include -I$GM_HOME/include"
./configure \
--with-device=ch_gm \
--enable-sharedlib \
-prefix=$PREFIX \
-opt="-O2" -c++=icc \
-lib="-Wl,-rpath,$GM_HOME/lib/ -L$GM_HOME/lib/ -lgm"
make
make install

GROMACS-3.3.1
export CC=icc
export CXX=icpc
export F77=ifort
export FC=ifort
export CPPFLAGS="-I${HOME}/Software/fftw-3.1.2-intel/include"
export LDFLAGS="-L${HOME}/Software/fftw-3.1.2-intel/lib"
./configure \
--enable-double \
--with-fft=fftw3 \
--program-suffix='' \
--enable-fortran \
--prefix=${HOME}/Software/gromacs-3.3.1-intel
make
make install
make distclean
./configure \
--enable-mpi \
--enable-double \
--with-fft=fftw3 \
--program-suffix=_mpi \
--enable-fortran \
--prefix=${HOME}/Software/gromacs-3.3.1-intel
make mdrun
make install-mdrun

- #4 Full gcc 3.2.3 series (gcc, cc, f77)

MACHINE CPU TYPE = Intel(R) Xeon(R) CPU 5140 @ 2.33GHz

FFTW-3.1.2
./configure \
--enable-float \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-gnu
make
make install
make distclean
./configure \
--enable-threads \
--prefix=${HOME}/Software/fftw-3.1.2-gnu
make
make install

MPICHGM-1.2.7..15
setenv CC gcc
setenv FC g77
setenv GM_HOME /opt/gm
setenv PREFIX /home/matteo/Software/mpichgm-1.2.7..15-gnu
setenv RSHCOMMAND ssh
setenv CFLAGS "-I$GM_HOME/binary/include -I$GM_HOME/include"
./configure \
--with-device=ch_gm \
--enable-sharedlib \
-prefix=$PREFIX \
-opt="-O"\
-lib="-Wl,-rpath,$GM_HOME/lib -L$GM_HOME/lib/ -lgm"
make
make install

GROMACS-3.3.1
export CPPFLAGS="-I${HOME}/Software/fftw-3.1.2-intel/include"
export LDFLAGS="-L${HOME}/Software/fftw-3.1.2-intel/lib"
./configure \
--enable-double \
--with-fft=fftw3 \
--program-suffix='' \
--prefix=${HOME}/Software/gromacs-3.3.1-gnu
make
make install
make distclean
./configure \
--enable-mpi \
--enable-double \
--with-fft=fftw3 \
--program-suffix=_mpi \
--prefix=${HOME}/Software/gromacs-3.3.1-gnu
make mdrun
make install-mdrun

  1. CONCLUSIONS ####
    Same problem on those architecture using
    gromacs either copiled with intel or gnu.

PS: Same initial geometry with "same" input file
runs smoothly using other MD parallel codes.

#18 Updated by Matteo Guglielmi over 12 years ago

Created an attachment (id=194)
See comment #17

ci variable barely out of bounds after 0.2 - 0.5 ns

- parallel run (on 4 and 8 CPUs)
- gromacs compiled either with the Intel compilers 9.1 & 8.0 or gcc 3.4.6 &
3.2.3
- pressure coupling on
- position restraints on

#19 Updated by David van der Spoel over 12 years ago

I suspect the problems that Matteo is experiencing can be remedied by changing
angular comm removal to linear. I'm not sure this is the same problem that
Rainer and Chris are experiencing though.

Matteo, how long does this simulation have to run before it crashes?

#20 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #19)

I suspect the problems that Matteo is experiencing can be remedied by changing
angular comm removal to linear. I'm not sure this is the same problem that
Rainer and Chris are experiencing though.

Matteo, how long does this simulation have to run before it crashes?

... ~0.2ns

Anyway keeping those position restraints on along with angular comm removal
ALL my simulation (eight) still crash even after 2ns (same error message every
0.2ns - 0.5ns)

In one of them, right before the crash, I've noticed a single exploding water
molecule... in the last frame it was located something like ~10A away from the
box (parallel job on 32 CPUs).

#21 Updated by David van der Spoel over 12 years ago

If it is that reproducible, then please try without angular COM removal.

My simulation is at 335 ps, but still fine. It has run for 60 hours on 4
processors, which makes it less than ideal as a test simulation.

#22 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #21)

If it is that reproducible, then please try without angular COM removal.

My simulation is at 335 ps, but still fine. It has run for 60 hours on 4
processors, which makes it less than ideal as a test simulation.

I'm trying right now with Linear COMM removal...

#23 Updated by Chris Neale over 12 years ago

(In reply to comment #21)

If it is that reproducible, then please try without angular COM removal.

If it is useful to know, my previous problems were with comm_mode=linear
http://www.gromacs.org/pipermail/gmx-users/2006-October/024333.html

#24 Updated by David van der Spoel over 12 years ago

we clearly need more and better test systems...

#25 Updated by Chris Neale over 12 years ago

(In reply to comment #24)

we clearly need more and better test systems...

I'm working on that right now.

#26 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #24)

we clearly need more and better test systems...

It crashed right now after 0.1ns, two water molecules went out of the box!
Their distance from the box is about 20A.

Serial run is still fine after 0.15ns... I'll keep it running.

So...not yet 100% sure but it seems to be a problem which affects only
the parallel implementation of gromacs.

#27 Updated by David van der Spoel over 12 years ago

After how long is the crash? Plz upload tpr file. Did you specify hardware and
compiler details already?

#28 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #27)

After how long is the crash? Plz upload tpr file. Did you specify hardware and
compiler details already?

About "how I did compile gmx" please have a look at comment #17.

About the tpr file... I have 8 tpr files which do crash (100% of my simulations)
but the system size is quite big (120K atoms) and is almost the same for all
my systems.

One of them you have it already (21.87MB, see attachment).

Please let me know how I can compile gromacs with the "-DEBUG_PBC" keword.

Thanks,
MG.

#29 Updated by Chris Neale over 12 years ago

(In reply to comment #28)

Please let me know how I can compile gromacs with the "-DEBUG_PBC" keword.

from what I can recall it's in the CPPFLAGS section, like this:

export FFTW_LOCATION=/projects/pomes/cneale/exe/fftw-3.1.2/exec
export GROMACS_LOCATION=/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2
export CPPFLAGS=-I$FFTW_LOCATION/include -DDEBUG_PBC
export LDFLAGS=-L$FFTW_LOCATION/lib

./configure --prefix=$GROMACS_LOCATION --program-suffix="_debugpbc"
make
make install

#30 Updated by David van der Spoel over 12 years ago

Matteo, can you upload the linear comm 32 cpu tpr as well? Or anything that
crashes fast in clock time. I have 32 CPUs available right now I think.
(in 4.0 we will at last get rid of tprs with predefined CPU number...)

#31 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #30)

Matteo, can you upload the linear comm 32 cpu tpr as well? Or anything that
crashes fast in clock time. I have 32 CPUs available right now I think.
(in 4.0 we will at last get rid of tprs with predefined CPU number...)

Sure.

#32 Updated by Matteo Guglielmi over 12 years ago

Created an attachment (id=195)
Linear 32CPUs tpr file

#33 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #29)

(In reply to comment #28)

Please let me know how I can compile gromacs with the "-DEBUG_PBC" keword.

from what I can recall it's in the CPPFLAGS section, like this:

export FFTW_LOCATION=/projects/pomes/cneale/exe/fftw-3.1.2/exec
export GROMACS_LOCATION=/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2
export CPPFLAGS=-I$FFTW_LOCATION/include -DDEBUG_PBC
export LDFLAGS=-L$FFTW_LOCATION/lib

./configure --prefix=$GROMACS_LOCATION --program-suffix="_debugpbc"
make
make install

Would be enough to give a "make mdrun && make install-mdrun" (after having
properly set up the CPPFLAG) or do I need to recompile grompp as well???

#34 Updated by Chris Neale over 12 years ago

(In reply to comment #33)

Would be enough to give a "make mdrun && make install-mdrun" (after having
properly set up the CPPFLAG) or do I need to recompile grompp as well???

I don't know. It's pretty important to get the test right so I recommend that
you recompile the whole thing (that's what I did).

#35 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #34)

(In reply to comment #33)

Would be enough to give a "make mdrun && make install-mdrun" (after having
properly set up the CPPFLAG) or do I need to recompile grompp as well???

I don't know. It's pretty important to get the test right so I recommend that
you recompile the whole thing (that's what I did).

Ok, got it.

Let you know soon how things will develop.

#36 Updated by Chris Neale over 12 years ago

(In reply to comment #24)

we clearly need more and better test systems...

A box of water initially set up at half density does not reproduce the problem.
The system shrinks as follows in serial and in parallel (0-270ps, data every 10ps):

The average volume was 91.125000
The average volume was 83.607343
The average volume was 79.431308
The average volume was 75.783275
The average volume was 72.734277
The average volume was 69.411146
The average volume was 65.812329
The average volume was 61.980847
The average volume was 59.010001
The average volume was 55.604337
The average volume was 51.586525
The average volume was 49.174332
The average volume was 48.879807
The average volume was 48.735659
The average volume was 48.651260
The average volume was 48.745670
The average volume was 48.711757
The average volume was 48.552841
The average volume was 48.672634
The average volume was 48.610112
The average volume was 48.672263
The average volume was 48.872321
The average volume was 48.655607
The average volume was 48.863955
The average volume was 48.665827
The average volume was 48.788901
The average volume was 48.554459
The average volume was 48.725477

#37 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #29)

(In reply to comment #28)

Please let me know how I can compile gromacs with the "-DEBUG_PBC" keword.

from what I can recall it's in the CPPFLAGS section, like this:

export FFTW_LOCATION=/projects/pomes/cneale/exe/fftw-3.1.2/exec
export GROMACS_LOCATION=/projects/pomes/cneale/exe/gromacs-3.3.1/exec/fftw-3.1.2
export CPPFLAGS=-I$FFTW_LOCATION/include -DDEBUG_PBC
export LDFLAGS=-L$FFTW_LOCATION/lib

./configure --prefix=$GROMACS_LOCATION --program-suffix="_debugpbc"
make
make install

I went thru the gmx code and discovered that the keword DEBUG_PBC
acts only on the following two files:

./src/gmxlib/pbc.c

#ifdef DEBUG_PBC
for(d=0; (d<DIM); d++) {
if ((cg_cm[icg][d] < 0) || (cg_cm[icg][d] >= box[d][d]))
gmx_fatal(FARGS,"cg_cm[%d] = %15f %15f %15f\n"
"box = %15f %15f %15f\n",
icg,cg_cm[icg][XX],cg_cm[icg][YY],cg_cm[icg][ZZ],
box[XX][XX],box[YY][YY],box[ZZ][ZZ]);
}
#endif

./src/mdlib/nsgrid.c

#ifdef DEBUG_PBC
#define myrc(ixyz,n) if ((ixyz<0) || (ixyz>=n)) gmx_fatal(FARGS,"%s=%d(max=%d),
index=%d, i=%d,
cgcm=(%f,%f,%f)",#ixyz,ixyz,n,index,i,cg_cm[index][XX],cg_cm[index][YY],cg_cm[index][ZZ])
myrc(ix,nrx);
myrc(iy,nry);
myrc(iz,nrz);
#undef myrc
#endif

If this few lines are enough to fix the problem the solution must be there :-)

#38 Updated by Chris Neale over 12 years ago

(In reply to comment #37)

I went thru the gmx code and discovered that the keword DEBUG_PBC
acts only on the following two files:

./src/gmxlib/pbc.c

#ifdef DEBUG_PBC
for(d=0; (d<DIM); d++) {
if ((cg_cm[icg][d] < 0) || (cg_cm[icg][d] >= box[d][d]))
gmx_fatal(FARGS,"cg_cm[%d] = %15f %15f %15f\n"
"box = %15f %15f %15f\n",
icg,cg_cm[icg][XX],cg_cm[icg][YY],cg_cm[icg][ZZ],
box[XX][XX],box[YY][YY],box[ZZ][ZZ]);
}
#endif

./src/mdlib/nsgrid.c

#ifdef DEBUG_PBC
#define myrc(ixyz,n) if ((ixyz<0) || (ixyz>=n)) gmx_fatal(FARGS,"%s=%d(max=%d),
index=%d, i=%d,

cgcm=(%f,%f,%f)",#ixyz,ixyz,n,index,i,cg_cm[index][XX],cg_cm[index][YY],cg_cm[index][ZZ])

myrc(ix,nrx);
myrc(iy,nry);
myrc(iz,nrz);
#undef myrc
#endif

If this few lines are enough to fix the problem the solution must be there :-)

As I previously suggested, the "fix" may be in slowing down the program at the
appropriate time. I believe this because, for me and the system that I was
using, the problem was reliable with the non-DDEBUG_PBC version but the problem
never arose in the DDEBUG_PBC version. In fact I had expected -DDEBUG_PBC to
assist me in finding the actual error, but instead it abolished the error.
Take that as you will, but all -DDEBUG_PBC does is to check for an error so if
it abolishes an error that seems to me like a timing issue -
this is why I
believe that the error is from some non-blocking MPI call regarding updating the
grid size for PME. I do however leave lots of room for the possibility that
everything is working as it should and the three of us simply have systems with
our own errors in them.

I will attempt to build a fast-running small system containing lipids to
reproduce the error in case somehow the presence of long molecules spanning the
PBC boundary is important (although I don't see how it could be). However, this
is not simple since we still don't have a concrete hypothesis about what might
lead to this code-error, if it is a code-error at all.

#39 Updated by David van der Spoel over 12 years ago

It would indeed be great to have such a system where one can play with e.g.
pbc=xyz or pbc=full, and other settings. For instance having one or more groups
for comm removal. Obviously these suggestions are meant as hints to those having
any system that crashes at all... For me the 32 CPU system shows another
problem, namely that only 3 processes on my 4 cpu nodes do anything. Grrrr.

#40 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #38)

As I previously suggested, the "fix" may be in slowing down the program at the
appropriate time. I believe this because, for me and the system that I was
using, the problem was reliable with the non-DDEBUG_PBC version but the problem

You know what Chris, I guess you are right and I tell you why.

As you know I did compile gromacs with 4 different:

gmx gnu 3.2.3 (slower)
gmx gnu 3.4.6 (slower)
gmx intel 8.0 (faster)
gmx intel 9.1 (faster)

and it crashes at least 20% faster with intel.. I mean, It crashes
more often when it is intel compiled.

In addition to that, when you told me to compile gmx with the DDEBUG_PBC
keyword (and so I did), the intel version is running almost fine i.e. I
had only one crash at .15ns out of 0.4ns (never got so far with this system
linear 32cpus - see attachment)!

#41 Updated by David van der Spoel over 12 years ago

Created an attachment (id=196)
pbc.c source code

Matteo's system with angular comm removal finished after 500 ns (that is, the
entire run) without problems on 4 cpus. The 32 cpu system (which is completely
different) is still running and at 125 ps. Maybe you could try to recompile
with this version of src/gmxlib/pbc.c in which I introduced a temp variable to
overcome problems with gcc bugs.

#42 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #41)

Created an attachment (id=196) [edit]
pbc.c source code

Matteo's system with angular comm removal finished after 500 ns (that is, the
entire run) without problems on 4 cpus. The 32 cpu system (which is completely
different) is still running and at 125 ps. Maybe you could try to recompile
with this version of src/gmxlib/pbc.c in which I introduced a temp variable to
overcome problems with gcc bugs.

Even with -DDEBUG_PBC I got a single exploding water molecule at ~0.7ns.

So what I did was:

- 1st run (0.5ns; -DDEBUG_PBC on)
had only one single crash around 0.15ns (gromacs crashed!)
After that no problem up to 0.5ns.

- 2nd run (tinit=500, 0.6ns more; -DDEBUG_PBC on)
here things get weird because gromacs did not crash
with the usual "ci varialble..." error message, but it
stopped writing the xtc trajectory file after 0.152ns!
The last frame of the xtc file shows, again, a single
exploding water molecule.

The two files (.trr, .edr) look fine... total, kinetic
and potential energies are stable (kinetic energy doesn't
show any problem) but the .gro file (last frame of the
trajectory produced by gromacs) doesn't show the exploded
water molecule anymore!

- 3rd run (tinit=1.1ns; 0.6ns more; -DDEBUG_PBC on)
grompp did not complain about anything i.e. .trr .edr and
.gro files from the previous run are fine.
(now it's still running)

Could you explain this to me???
What happened during the second run???
Is it everything fine or not???

#43 Updated by Chris Neale over 12 years ago

(In reply to comment #42)

(In reply to comment #41)

Created an attachment (id=196) [edit] [edit]
pbc.c source code

Matteo's system with angular comm removal finished after 500 ns (that is, the
entire run) without problems on 4 cpus. The 32 cpu system (which is completely
different) is still running and at 125 ps. Maybe you could try to recompile
with this version of src/gmxlib/pbc.c in which I introduced a temp variable to
overcome problems with gcc bugs.

Even with -DDEBUG_PBC I got a single exploding water molecule at ~0.7ns.

So what I did was:

- 1st run (0.5ns; -DDEBUG_PBC on)
had only one single crash around 0.15ns (gromacs crashed!)
After that no problem up to 0.5ns.

- 2nd run (tinit=500, 0.6ns more; -DDEBUG_PBC on)
here things get weird because gromacs did not crash
with the usual "ci varialble..." error message, but it
stopped writing the xtc trajectory file after 0.152ns!
The last frame of the xtc file shows, again, a single
exploding water molecule.

The two files (.trr, .edr) look fine... total, kinetic
and potential energies are stable (kinetic energy doesn't
show any problem) but the .gro file (last frame of the
trajectory produced by gromacs) doesn't show the exploded
water molecule anymore!

- 3rd run (tinit=1.1ns; 0.6ns more; -DDEBUG_PBC on)
grompp did not complain about anything i.e. .trr .edr and
.gro files from the previous run are fine.
(now it's still running)

Could you explain this to me???
What happened during the second run???
Is it everything fine or not???

Look at the final coordinates from your second system. It could be that one of
the waters is now very distant from the rest of the system (I know... PBC, but
still this doesn't occur for healthy runs).

#44 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #43)

Look at the final coordinates from your second system. It could be that one of
the waters is now very distant from the rest of the system (I know... PBC, but
still this doesn't occur for healthy runs).

No... the water molecule is right in place now!

What the heck is going on here?
What kind of bug are we dealing with?

I'm simply lost... how comes that gromacs failed
writing the .xtc trajectory... did not crash after
the explosion of a single water molecule... and, at
the end of the run the water molecule went back into
the box???

Before gromacs was stopping upon the explosion of
the water molecule... this time it did not but
the xtc trajectory clearly shows a single exploaded
water molecule 30 angstroms away from the box!

Since I'm saving frames every 1000 steps (dt=1fs)
i.e. every 1ps, that water molecule walked something
like 11nm in 1ps which is simply absurd.

#45 Updated by Matteo Guglielmi over 12 years ago

Than I have another question... why did I get always
exploded water molecules???

Are they treated in a different way by the gmx code?
How pbc works on them?

When I got two exploded water molecules (comment #26)
I did also noticed that they were sharing almost the
same x and y coordinates... only the z coordinates was
different... I mean, there was clearly some correlation
between the two.

#46 Updated by Matteo Guglielmi over 12 years ago

Created an attachment (id=197)
Water molecule before its explosion.

It is more than 30A away from any edge of the box.

#47 Updated by Matteo Guglielmi over 12 years ago

Created an attachment (id=198)
Exploded water molecule

It is 30A off of the box right after 1ps

#48 Updated by Matteo Guglielmi over 12 years ago

Created an attachment (id=199)
Neighboring water molecules before the explosion

those are the neighboring water molecules within a 5A radius of the exploding
water molecule

#49 Updated by David van der Spoel over 12 years ago

I still haven't been able to reproduce the problem. You think you can make a
smaller system? Just cut out 75 % of the water and the lipids.

There is a difference between water and other molecules for computation of the
forces but not for PBC in anyway. How large is your box in the z direction?

#50 Updated by Matteo Guglielmi over 12 years ago

Created an attachment (id=200)
Neighboring water molecules after the explosion

Those are the same neighboring water molecules that were previously selected
within a 5A radius around the exploding water molecule after the explosion
(1ps)

#51 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #49)

I still haven't been able to reproduce the problem. You think you can make a
smaller system? Just cut out 75 % of the water and the lipids.

There is a difference between water and other molecules for computation of the
forces but not for PBC in anyway. How large is your box in the z direction?

My box is 105 x 115 x 86 A^3

And this is another good question... how is it possible that same
problem did not occur to you yet?

I'm having this problem with gromacs compiled in 4 different ways
and running on 3 different cluster.

May I send you my executables?

#52 Updated by David van der Spoel over 12 years ago

Executables would only work if the libraries (mpi, fft) are statically linked in.

#53 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #52)

Executables would only work if the libraries (mpi, fft) are statically linked in.

They are not but I can do it if you wanna go for it.

#54 Updated by David van der Spoel over 12 years ago

(In reply to comment #14)

If you used any special hardware and/or libraries, can you reproduce the error

with gcc, LAM-MPI, and

standard communication (either ethernet, or using shared memory by running all

8 processes on a

- we reproduced the error now also on a gigabit dual-xeon cluster (lam-mpi)
cheers,
rainer

I redid your run for 200 ps on 8 processors. One lipid (51) dissolves into the
water phase, which is weird, but nothing goes out of the box. In the frame from
100 ps your lipid is inside the box as well (with just one atom!). I'm now
redoing the run on 4 and 16 processors.

#55 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #15)

Did you try with a version of mdrun that was compiled with -DEBUG_PBC ?

I have some runs that reliably (but stochastically) give errors about an atom
being found in a grid just one block outside of the expected boundary only in
parallel runs, and often other nodes have log files that indicate that they have
just updated the grid size (constant pressure simulation). This error disappears
when I run with a -DEBUG_PBC version. My assumption here is that there is some
non-blocking MPI communication that is not getting through in time. The
-DEBUG_PBC version spends a lot of time checking some things and although it
never reports having found some problem, I assume that a side-effect of these
extra calculations is to slow things down enough at the proper stage so that the
MPI message gets through. I have solved my problem by adjusting my simulation
cell so that it doesn't fall close to the grid boundaries. Perhaps you are
experiencing some analogous problem?

I do confirm: all my parallel runs when converted to "serial" do not crash at all.

#56 Updated by David van der Spoel over 12 years ago

I'm closing in on the bug. It seems that the forces are different on some atoms
when using PME on 8 processors as compared to 1 or 4 processors (this is based
on Rainers systems). Already at step 0 the force on the lipid that later on is
extracted from the membrane has a very different force. When I turn off PME the
problem goes away, and when I use Coulombtype = Shift the problem is not there
either, throwing the suspicion on the long range forces. More later.

#57 Updated by David van der Spoel over 12 years ago

Superficially it seems that setting constraints = all-bonds solves the problem.
Please try it. Both Matteo and Rainer apparently used constraints = h-bonds. How
about you, Chris?

#58 Updated by Rainer Böckmann over 12 years ago

Hi All,

we currently avoid this bug we choosing the system size & cut-offs in such a
way that no change in grid size occurs. The bug seems to be tightly coupled to
updated grids.

Best,
Rainer

(In reply to comment #57)

Superficially it seems that setting constraints = all-bonds solves the

problem.

Please try it. Both Matteo and Rainer apparently used constraints = h-bonds.

How

about you, Chris?

#59 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #57)

Superficially it seems that setting constraints = all-bonds solves the problem.
Please try it. Both Matteo and Rainer apparently used constraints = h-bonds. How
about you, Chris?

I did use both without getting rid of the problem... actually I did try almost
everything about tuning options in the input file.

Serial runs do not crash guys, all my system that were crashing in parallel
(7 over 7) do not crash anymore (same input file, just moved to one single cpu).

#60 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #57)

Superficially it seems that setting constraints = all-bonds solves the problem.
Please try it. Both Matteo and Rainer apparently used constraints = h-bonds. How
about you, Chris?

I did use both without getting rid of the problem... actually I did try almost
everything about tuning options in the input file.

Serial runs do not crash guys, all my system that were crashing in parallel
(7 over 7) do not crash anymore (same input file, just moved to one single cpu).

#61 Updated by David van der Spoel over 12 years ago

My impression at the moment is that when you break molecules over processors
(i.e. one molecule on multiple processors) you trigger the bug. This is why the
constraints = all-bonds flag prevents the bug.

It could be that Matteo is experiencing another problem though.

#62 Updated by David van der Spoel over 12 years ago

Created an attachment (id=201)
Simple test case

Here is a simpler test case made from Rainer's system. Running this test on 1
or 2 processors demonstrates that the forces are different (use gmxcheck -f -f2
to compare trr files). The system just contains one POPC molecule and no water.

#63 Updated by David van der Spoel over 12 years ago

I've put a fix in CVS for the problems that Rainer had. Matteo, could you please
test whether this helps your simulation as well?

The following files were modified:

include/force.h src/gmxlib/ewald_util.c src/gmxlib/shift_util.c
src/mdlib/rf_util.c src/mdlib/force.c

#64 Updated by Rainer Böckmann over 12 years ago

Thanks a lot!!! Could this problem also have affected proteins in protein-water
simulations, or in mixed protein-lipid systems? I fear that one wouldn't always
identify artefacts that easily like for the lipid pulled out in the reported case.
Best,
Rainer

(In reply to comment #63)

I've put a fix in CVS for the problems that Rainer had. Matteo, could you please
test whether this helps your simulation as well?

The following files were modified:

include/force.h src/gmxlib/ewald_util.c src/gmxlib/shift_util.c
src/mdlib/rf_util.c src/mdlib/force.c

#65 Updated by David van der Spoel over 12 years ago

Yes, anytime you are splitting molecules over processors this could have been a
problem. However most people don't do this because it means you have to reduce
the timestep as well.

#66 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #65)

Yes, anytime you are splitting molecules over processors this could have been a
problem. However most people don't do this because it means you have to reduce
the timestep as well.

I'm currently using:

dt = 0.001
constraints = hbonds
constraint_algorithm = lincs

and I have a big membrane in almost all my systems.

Would this be a problem?

Would it trigger "my exploding water molecule(s)" as well?

Please note that even using the following keywords:

dt = 0.002
constraints = all-bonds
constraint_algorithm = lincs

I got those explosions....

#67 Updated by Chris Neale over 12 years ago

(In reply to comment #57)

Superficially it seems that setting constraints = all-bonds solves the problem.
Please try it. Both Matteo and Rainer apparently used constraints = h-bonds. How
about you, Chris?

I was using all-bonds when I got the error. Here is the mdp file that I used:

title = seriousMD
cpp = /home/cneale/exe/cpp
integrator = md
nsteps = 350000
tinit = 0
dt = 0.002
comm_mode = linear
nstcomm = 1
comm_grps = System
nstxout = 50000
nstvout = 50000
nstfout = 50000
nstlog = 1000
nstlist = 10
nstenergy = 5000
nstxtcout = 5000
ns_type = grid
pbc = xyz
coulombtype = PME
rcoulomb = 0.9
fourierspacing = 0.12
pme_order = 4
vdwtype = cut-off
rvdw_switch = 0
rvdw = 1.4
rlist = 0.9
DispCorr = no
Pcoupl = Berendsen
pcoupltype = semiisotropic
compressibility = 4.5e-5 4.5e-5
ref_p = 1. 1.
tau_p = 4.0 4.0
tcoupl = Berendsen
tc_grps = Protein LDA_POPE_DMPE SOL_NA+
tau_t = 0.1 0.1 0.1
ref_t = 300. 300. 300.
annealing = no
gen_vel = yes
unconstrained-start = no
gen_temp = 300.
gen_seed = 9896
constraints = all-bonds
constraint_algorithm= lincs
lincs-iter = 1
lincs-order = 4

#68 Updated by Chris Neale over 12 years ago

I have been unable to reproduce the problem with a smaller system. In this
process, I have noticed a couple of things:

1. When I was having this problem back in Oct 2006, I was using a double
precision grompp_d and a single precision mdrun_mpi (Matteo, what about you?)

2. I now use openmpi v1.2.1. At that time I was using lam for my MPI. I find
openMPI is much faster and has less problems.

I will try to roll everything back and change things one at a time to see what
the problem was and what solved it.

Chris.

#69 Updated by Matteo Guglielmi over 12 years ago

(In reply to comment #68)

I have been unable to reproduce the problem with a smaller system. In this
process, I have noticed a couple of things:

1. When I was having this problem back in Oct 2006, I was using a double
precision grompp_d and a single precision mdrun_mpi (Matteo, what about you?)

Always double precision (never compiled the single precision version of gmx).

2. I now use openmpi v1.2.1. At that time I was using lam for my MPI. I find
openMPI is much faster and has less problems.

I tried openmpi (1.2.1 & 1.2.2) and mpich-mx (mpich tuned for myrinet network,
1.2.7..4), gromacs was crashing everywhere.

I will try to roll everything back and change things one at a time to see what
the problem was and what solved it.

In my case the solution "seems" to be the magic "-DDEBUG_PBC".
(I got only one more single crash after having recompiled gmx - see comment #40)

Chris.

Matteo.

#70 Updated by Chris Neale over 12 years ago

(In reply to comment #69)

In my case the solution "seems" to be the magic "-DDEBUG_PBC".
(I got only one more single crash after having recompiled gmx - see comment #40)

I suppose at this point your best chance for a solution is to reorganize your
system so that you aren't oscillating about a grid boundary for your constant
pressure. It worked perfectly for both Rainer and me. It's not the most elegant
solution, but then again neither is spending months trying to fix some problem
that can be easily avoided.

Also I suggest that this bug be closed and marked as "fixed in the CVS". Based
on comment #63 it appears that the problem that was originally posted has been
solved. It is true that there is much discussion here about something that still
appears to be a bug, but it also appears to be a different bug.

Chris.

#71 Updated by David van der Spoel over 12 years ago

Bug closed now, please open a new one if you have a reproducible case of the other kind of problems.

Also available in: Atom PDF