Project

General

Profile

Bug #247

Parallel run, 4 CPUs crashes, 1 CPU works

Added by Andreas Kring almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Problem submitted to gmx-users mailing list,
bugreport submitted on request by Berk Hess.

I have just installed GROMACS 4.0. My system consists of one hydroxide ion and 999 polarizable water molecules. When the system is minimized (using only one CPU) with a steepest descent minimizer, there are no problems. But when I try to minimize exactly the same system using 4 or 8 CPUs, I get the error below. I use the commands:

$ grompp_d -c system.gro -p topol.top -f steep.mdp

followed by

$ mpirun -np 4 mdrun_d -v
(tpr file attached to this bugreport)

NNODES=4, MYRANK=0, HOSTNAME=yoda
NNODES=4, MYRANK=1, HOSTNAME=yoda
NNODES=4, MYRANK=2, HOSTNAME=yoda
NODEID=0 argc=2
NODEID=2 argc=2
NNODES=4, MYRANK=3, HOSTNAME=yoda
NODEID=1 argc=2
NODEID=3 argc=2
:-) G R O M A C S (-:

Good ROcking Metal Altar for Chronical Sinners
:-)  VERSION 4.0  (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-)  mdrun4_d (double precision)  (-:

Option Filename Type Description
------------------------------------------------------------
-s topol.tpr Input Run input file: tpr tpb tpa
-o traj.trr Output Full precision trajectory: trr trj cpt
-x traj.xtc Output, Opt. Compressed trajectory (portable xdr format)
-cpi state.cpt Input, Opt. Checkpoint file
-cpo state.cpt Output, Opt. Checkpoint file
-c confout.gro Output Structure file: gro g96 pdb
-e ener.edr Output Energy file: edr ene
-g md.log Output Log file
-dgdl dgdl.xvg Output, Opt. xvgr/xmgr file
-field field.xvg Output, Opt. xvgr/xmgr file
-table table.xvg Input, Opt. xvgr/xmgr file
-tablep tablep.xvg Input, Opt. xvgr/xmgr file
-tableb table.xvg Input, Opt. xvgr/xmgr file
-rerun rerun.xtc Input, Opt. Trajectory: xtc trr trj gro g96 pdb cpt
-tpi tpi.xvg Output, Opt. xvgr/xmgr file
-tpid tpidist.xvg Output, Opt. xvgr/xmgr file
-ei sam.edi Input, Opt. ED sampling input
-eo sam.edo Output, Opt. ED sampling output
-j wham.gct Input, Opt. General coupling stuff
-jo bam.gct Output, Opt. General coupling stuff
-ffout gct.xvg Output, Opt. xvgr/xmgr file
-devout deviatie.xvg Output, Opt. xvgr/xmgr file
-runav runaver.xvg Output, Opt. xvgr/xmgr file
-px pullx.xvg Output, Opt. xvgr/xmgr file
-pf pullf.xvg Output, Opt. xvgr/xmgr file
-mtx nm.mtx Output, Opt. Hessian matrix
-dn dipole.ndx Output, Opt. Index file

Option Type Value Description
------------------------------------------------------
h bool no Print help info and quit
-nice int 0 Set the nicelevel
-deffnm string Set the default filename for all file options
[no]xvgr bool yes Add specific codes (legends etc.) in the output
xvg files for the xmgrace program
pd bool no Use particle decompostion
-dd vector 0 0 0 Domain decomposition grid, 0 is optimize
-npme int -1 Number of separate nodes to be used for PME, -1
is guess
-ddorder enum interleave DD node order: interleave, pp_pme or cartesian
[no]ddcheck bool yes Check for all bonded interactions with DD
rdd real 0 The maximum distance for bonded interactions with
DD (nm), 0 is determine from initial coordinates
-rcon real 0 Maximum distance for P-LINCS (nm), 0 is estimate
-dlb enum auto Dynamic load balancing (with DD): auto, no or yes
-dds real 0.8 Minimum allowed dlb scaling of the DD cell size
[no]sum bool yes Sum the energies at every step
v bool yes Be loud and noisy
[no]compact bool yes Write a compact log file
seppot bool no Write separate V and dVdl terms for each
interaction type and node to the log file(s)
-pforce real -1 Print all forces larger than this (kJ/mol nm)
[no]reprod bool no Try to avoid optimizations that affect binary
reproducibility
cpt real 15 Checkpoint interval (minutes)
[no]append bool no Append to previous output files when restarting
from checkpoint
maxh real -1 Terminate after 0.99 times this time (hours)
-multi int 0 Do multiple simulations in parallel
-replex int 0 Attempt replica exchange every # steps
-reseed int -1 Seed for replica exchange, -1 is generate a seed
[no]glas bool no Do glass simulation with special long range
corrections
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system

Getting Loaded...
Reading file topol.tpr, VERSION 4.0 (double precision)
Loaded with Money

Making 1D domain decomposition 4 x 1 x 1
Steepest Descents:
Tolerance (Fmax) = 1.00000e+01
Number of steps = 100000

A list of missing interactions:
exclusions of 14987 missing -1

-------------------------------------------------------
Program mdrun4_d, VERSION 4.0
Source code file: domdec_top.c, line: 337

Software inconsistency error:
One or more interactions were multiple assigned in the domain decompostion
-------------------------------------------------------

"Catholic School Girls Rule" (Red Hot Chili Peppers)

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun4_d on CPU 0 out of 4

gcq#136: "Catholic School Girls Rule" (Red Hot Chili Peppers)

topol.tpr (285 KB) topol.tpr tpr file Andreas Kring, 11/05/2008 10:04 AM

History

#1 Updated by Berk Hess almost 11 years ago

There is no tpr file attached.

And at what step did you get this error?
(before) step 0?

Berk

#2 Updated by Andreas Kring almost 11 years ago

Created an attachment (id=320)
tpr file

#3 Updated by Andreas Kring almost 11 years ago

Sorry, the tpr file should be attached now.

The crash was probably before or at step 0 as you suggest. I did run mdrun with the -v flag, i.e.

$ mpirun -np 4 mdrun_d -v

/Andreas

#4 Updated by Berk Hess almost 11 years ago

I fixed the bug in CVS head and for 4.0.1.
I forgot to reset the exclusion list after checking
the exclusions for each atom in a moleculetype.
Therefore things would go wrong when the first atoms
exclude all others, but later ones do not.
In the standard force fields this situation probably
never occurs, but in your atom it does.

If you want to fix it now in your source,
you need to move a loop into the loop in src/mdlib/force.c

546,548d545
< for(ai=a0; ai<a1; ai++) {
< bExcl[ai-a0] = FALSE;
< }
551a549,552

/* Clear the exclusion list for atom ai */
for(ai=a0; ai<a1; ai++) {
bExcl[ai-a0] = FALSE;
}

Berk

#5 Updated by Andreas Kring almost 11 years ago

I downloaded the file force.c from the CVS Head repository and replaced the old force.c (the one from gromacs-4.0.tar.gz) with the new one and compiled the code again.

I think there is still a bug, but I do not know if it is related to what you have just corrected.

When I start the simulation similar to what is described at the top of this bugreport everything seems to work the way it is supposed to. But the simulation never finishes. In the mdp-file, I have set the number of steps to 50000. When I reach the last step in the simulation, it suddenly stops, although all 8 CPUs are still running (and keep doing this until I stop them manually) at 100% according to 'top'. The final output from

$ tail -f md.log

is this

...
DD step 49998 load imb.: force 20.4% pme mesh/force 0.159

Step           Time         Lambda
49999 49999.00000 0.00000
Energies (kJ/mol)
LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Polarization
-1.09981e+03 -3.58248e+02 -2.14315e+04 -5.40975e+03 4.36763e+03
Potential Pressure (bar) Cons. rmsd ()
-2.39316e+04 9.88109e+03 0.00000e+00

DD step 49999 load imb.: force 20.5% pme mesh/force 0.159

Step           Time         Lambda
50000 50000.00000 0.00000
Energies (kJ/mol)
LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. Polarization
-1.09981e+03 -3.58248e+02 -2.14315e+04 -5.40975e+03 4.36763e+03
Potential Pressure (bar) Cons. rmsd ()
-2.39316e+04 9.89186e+03 0.00000e+00

and the final output from the terminal, where I launched the command

$ mpirun -np 8 mdrun_d

is this

...
replex int 0 Attempt replica exchange every # steps
-reseed int -1 Seed for replica exchange, -1 is generate a seed
[no]glas bool no Do glass simulation with special long range
corrections
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system

Reading file topol.tpr, VERSION 4.0 (double precision)
NNODES=8, MYRANK=5, HOSTNAME=yoda
NODEID=5 argc=3

Will use 6 particle-particle and 2 PME only nodes
This is a guess, check the performance at the end of the log file
Making 1D domain decomposition 6 x 1 x 1
Steepest Descents:
Tolerance (Fmax) = 1.00000e+01
Number of steps = 50000

writing lowest energy coordinates.

Steepest Descents did not converge to Fmax < 10 in 50001 steps.
Potential Energy = -2.39316369053479e+04
Maximum force = 2.51619503503133e+03 on atom 3
Norm of force = 7.90806760075745e+01

That's it - no funny quote or anything?
Is this related to Bug 247 or should I file a new bugzilla?

Regards
Andreas

#6 Updated by Berk Hess almost 11 years ago

Ah, that is a bug with minimization and separate PME nodes.
I have fixed it in CVS and for 4.0.

You can update (cvs checkout -j release-4-0-patches gmx
will give you the release branch).
Or run minimization with mdrun -npme 0 until 4.0.1
comes out (which should be soon with now many small bugfixes).

Berk

#7 Updated by Andreas Kring almost 11 years ago

Great - thanks again for the fast bugfix.

I have a small problem with the CVS though:

I did the following (following the instruction on the wiki and in your mail):

$ cvs -z3 -d :pserver::/home/gmx/cvs login
Password: (hit enter)
$ cvs -z3 -d :pserver::/home/gmx/cvs co gmx
$ cd gmx
$ cvs checkout -j release-4-0-patches gmx

This seems to work fine, i.e. a lot of files are downloaded, but after execution of the above commands the 'configure' file is gone?

There is a 'config' directory, but I'm not quite sure what to do?

Regards
Andreas

#8 Updated by Berk Hess almost 11 years ago

You have to run the bootstrap script first.

But you have now checked out the head tree (with the co command)
and the branch on top of that.
Things might be ok, but it is probably better to remove everything
and only check out the release branch.

Berk

Also available in: Atom PDF