Project

General

Profile

Bug #267

distance restraints and domain decomposition error

Added by Anonymous about 11 years ago. Updated about 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Created an attachment (id=329)
This is the tpr file for the full 1 ns md simulation

Several attempts have been made to run a simulation on a 13 residue peptide fragment with distance restraints, in parallel. After typing this exact command line for mdrun:

nohup mpiexec -n 4 nice -19 mdrunmpi -s md.tpr -c md.gro -o md.trr -x md.xtc -e md.edr -g md.log < /dev/null &

The simulation stopped almost immediately, generating this nohup.out file:

NNODES=4, MYRANK=0, HOSTNAME=chong06.chem.pitt.edu
NNODES=4, MYRANK=1, HOSTNAME=chong06.chem.pitt.edu
NNODES=4, MYRANK=3, HOSTNAME=chong06.chem.pitt.edu
NNODES=4, MYRANK=2, HOSTNAME=chong06.chem.pitt.edu
NODEID=0 argc=13
NODEID=1 argc=13
NODEID=2 argc=13
NODEID=3 argc=13
:-) G R O M A C S (-:

Great Red Oystrich Makes All Chemists Sane
:-)  VERSION 4.0  (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
:-)  mdrunmpi  (-:

Option Filename Type Description
------------------------------------------------------------
-s md.tpr Input Run input file: tpr tpb tpa
-o md.trr Output Full precision trajectory: trr trj cpt
-x md.xtc Output, Opt! Compressed trajectory (portable xdr format)
-cpi state.cpt Input, Opt. Checkpoint file
-cpo state.cpt Output, Opt. Checkpoint file
-c md.gro Output Structure file: gro g96 pdb
-e md.edr Output Energy file: edr ene
-g md.log Output Log file
-dgdl dgdl.xvg Output, Opt. xvgr/xmgr file
-field field.xvg Output, Opt. xvgr/xmgr file
-table table.xvg Input, Opt. xvgr/xmgr file
-tablep tablep.xvg Input, Opt. xvgr/xmgr file
-tableb table.xvg Input, Opt. xvgr/xmgr file
-rerun rerun.xtc Input, Opt. Trajectory: xtc trr trj gro g96 pdb cpt
-tpi tpi.xvg Output, Opt. xvgr/xmgr file
-tpid tpidist.xvg Output, Opt. xvgr/xmgr file
-ei sam.edi Input, Opt. ED sampling input
-eo sam.edo Output, Opt. ED sampling output
-j wham.gct Input, Opt. General coupling stuff
-jo bam.gct Output, Opt. General coupling stuff
-ffout gct.xvg Output, Opt. xvgr/xmgr file
-devout deviatie.xvg Output, Opt. xvgr/xmgr file
-runav runaver.xvg Output, Opt. xvgr/xmgr file
-px pullx.xvg Output, Opt. xvgr/xmgr file
-pf pullf.xvg Output, Opt. xvgr/xmgr file
-mtx nm.mtx Output, Opt. Hessian matrix
-dn dipole.ndx Output, Opt. Index file

Option Type Value Description
------------------------------------------------------
h bool no Print help info and quit
-nice int 0 Set the nicelevel
-deffnm string Set the default filename for all file options
[no]xvgr bool yes Add specific codes (legends etc.) in the output
xvg files for the xmgrace program
pd bool no Use particle decompostion
-dd vector 0 0 0 Domain decomposition grid, 0 is optimize
-npme int -1 Number of separate nodes to be used for PME, -1
is guess
-ddorder enum interleave DD node order: interleave, pp_pme or cartesian
[no]ddcheck bool yes Check for all bonded interactions with DD
rdd real 0 The maximum distance for bonded interactions with
DD (nm), 0 is determine from initial coordinates
-rcon real 0 Maximum distance for P-LINCS (nm), 0 is estimate
-dlb enum auto Dynamic load balancing (with DD): auto, no or yes
-dds real 0.8 Minimum allowed dlb scaling of the DD cell size
[no]sum bool yes Sum the energies at every step
v bool no Be loud and noisy
[no]compact bool yes Write a compact log file
seppot bool no Write separate V and dVdl terms for each
interaction type and node to the log file(s)
-pforce real -1 Print all forces larger than this (kJ/mol nm)
[no]reprod bool no Try to avoid optimizations that affect binary
reproducibility
cpt real 15 Checkpoint interval (minutes)
[no]append bool no Append to previous output files when restarting
from checkpoint
maxh real -1 Terminate after 0.99 times this time (hours)
-multi int 0 Do multiple simulations in parallel
-replex int 0 Attempt replica exchange every # steps
-reseed int -1 Seed for replica exchange, -1 is generate a seed
[no]glas bool no Do glass simulation with special long range
corrections
-[no]ionize bool no Do a simulation including the effect of an X-Ray
bombardment on your system

Reading file md.tpr, VERSION 4.0 (single precision)

NOTE: atoms involved in distance restraints should be within the longest cut-off distance, if this is not the case mdrun generates a fatal error, in that case use particle decomposition (mdrun option -pd)

WARNING: Can not write distance restraint data to energy file with domain decomposition
rank 0 in job 57 chong06.chem.pitt.edu_35438 caused collective abort of all ranks
exit status of rank 0: killed by signal 9

As you can see there is only a warning generated but no "fatal error" written anywhere in the output file.

md.tpr (284 KB) md.tpr This is the tpr file for the full 1 ns md simulation Anonymous, 12/09/2008 11:47 PM
mshift.c (20.9 KB) mshift.c fixed src/gmxlib/mshift.c Berk Hess, 12/11/2008 03:26 PM

History

#1 Updated by Berk Hess about 11 years ago

There are several issues here.

The crash is because of a bug where mdrun wants to print
an error, but the log file is not present on all processors
and the code does not check for this.
Running mdrun on one processor will give this (cryptic)
error/warning in the log file:

There were 14 inconsistent shifts. Check your topology

The problem is that you have distance restraints
at 3 nm distance. You said on gmx-users that all restraint
distances are within 1 nm, but I this does not seem to be true.

In principle this should work (although not with domain
decomposition, since 3 nm is more than half the box length).
But the current graph code can not handle bonded interactions
that are at more than half the box length.

We should come up with a solution for this.

David and Erik:
Currently the graph is built up in two steps,
first the chembond interactions, then all bonded
interactions. Currently this is useless, since the end
results is simply a graph with all interactions.
What we should do is implement an algorithm that walks
over the graph with only the chembonds, so we know
if it consists of multiple unlinked parts.
Then we should only add the non-chembond bonded interactions
if they connect two different parts.

Berk

#2 Updated by Berk Hess about 11 years ago

Created an attachment (id=330)
fixed src/gmxlib/mshift.c

#3 Updated by Berk Hess about 11 years ago

Hi,

I fixed the error print bug,
as well as the actual problem that previously
caused the inconsistent shift warnings.

I have attached the fixed mshift.c file.

Berk

#4 Updated by Anonymous about 11 years ago

Thanks for all your attention to this bug! I'm just not sure about why the distance restraints are coming up to be 3 nm apart. I'm fairly certain that they are all under 1 nm apart. This is an excerpt of the distance restraints I added to the topology file:

[distance_restraints]
;ai aj type index type' low up1 up2 fac
1 72 1 0 1 0.55 0.60 0.65 1.0
5 67 1 1 1 0.52 0.57 0.62 1.0
5 70 1 2 1 0.54 0.59 0.64 1.0
5 72 1 3 1 0.46 0.51 0.56 1.0
16 67 1 4 1 0.51 0.56 0.61 1.0
16 70 1 5 1 0.52 0.57 0.62 1.0
16 72 1 6 1 0.45 0.50 0.55 1.0
5 82 1 7 1 0.54 0.59 0.64 1.0
5 84 1 8 1 0.47 0.52 0.57 1.0
5 88 1 9 1 0.48 0.53 0.58 1.0
16 82 1 10 1 0.51 0.56 0.61 1.0

I keep the force constant at 1.0 throughout and I have increased up1 and up2 from the low value by 0.5 nm and 0.1 nm, respectively. My low value is the distance to which I want the atoms to be restrained. And the cut-off distance I'm using for restraining atoms is 0.55 nm. Also, the atoms pairs are at least 3 residues apart. Maybe I am doing something wrong when I add the restraints into the topology file? But I'm still not sure why it says the distance restraints are 3 nm apart.

Maria

There are several issues here.

The crash is because of a bug where mdrun wants to print
an error, but the log file is not present on all processors
and the code does not check for this.
Running mdrun on one processor will give this (cryptic)
error/warning in the log file:

There were 14 inconsistent shifts. Check your topology

The problem is that you have distance restraints
at 3 nm distance. You said on gmx-users that all restraint
distances are within 1 nm, but I this does not seem to be true.

In principle this should work (although not with domain
decomposition, since 3 nm is more than half the box length).
But the current graph code can not handle bonded interactions
that are at more than half the box length.

We should come up with a solution for this.

David and Erik:
Currently the graph is built up in two steps,
first the chembond interactions, then all bonded
interactions. Currently this is useless, since the end
results is simply a graph with all interactions.
What we should do is implement an algorithm that walks
over the graph with only the chembonds, so we know
if it consists of multiple unlinked parts.
Then we should only add the non-chembond bonded interactions
if they connect two different parts.

Berk

#5 Updated by Berk Hess about 11 years ago

The upcoming version of mdrun adds more information about bonded distances:
For your system:

Initial maximum inter charge-group distances:
two-body bonded interactions: 2.231 nm, Dis. Rest., atoms 48 114

So your bounds might be small, but the distance restraint still
has to work beyond the bound (at least up to 2.23 nm in this case).

Berk

Also available in: Atom PDF