Project

General

Profile

Bug #777

Implicit solvent crashes with particle decomposition

Added by Justin Lemkul over 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
4.5.4
Affected version:
Difficulty:
uncategorized
Close

Description

I'm testing some methodology with a robust system (everyone's favorite, 1AKI lysozyme). Running simulations with finite cutoffs works fine (in serial or parallel using DD), but if I try to call the all-vs-all kernels (e.g. infinite cutoffs), the simulations instantly crash (at step 0) with LINCS warnings. Running in serial, however, produces perfectly stable trajectories using all-vs-all, but the simulations are very slow. I can also add that finite cutoff simulations fail under PD, as well.

The effects are independent of integrator (tested sd and md) and hardware/compilers. The problem is reproducible on two very different systems:

1. x86_64 via OpenMPI 1.4.2 and compiled with cmake with gcc-4.3.4
2. Mac OSX with threads and compiled via autoconf with gcc-4.4.4

All force fields that I've tested (OPLS-AA, CHARMM27, and AMBER03) give the same result. The attached .tpr file uses the AMBER03 force field.

md_test.tpr (664 KB) md_test.tpr Input file for implicit solvent run Justin Lemkul, 07/12/2011 06:40 AM

Associated revisions

Revision 905c022c (diff)
Added by Berk Hess about 8 years ago

made allocation for LINCS with particle decomposition dynamic, fixes #777

Change-Id: I070b5f8917d12ab64896f62ecc80ed38b011f066

Revision 2fab5f65
Added by Erik Lindahl about 8 years ago

Merge "made allocation for LINCS with particle decomposition dynamic, fixes #777" into release-4-5-patches

History

#1 Updated by Justin Lemkul over 8 years ago

  • Assignee deleted (Berk Hess)

#2 Updated by Mark Abraham over 8 years ago

Single processor also works fine for me.

With

mpirun -np 2 mdrun_mpi -deffnm temp -s md_test

I get

Reading file md_test.tpr, VERSION 4.5.4 (single precision)

Back Off! I just backed up temp.xtc to ./#temp.xtc.7#

Back Off! I just backed up temp.edr to ./#temp.edr.7#

Step 0, time 0 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.584110, max 12.815955 (between atoms 466 and 469)
bonds that rotated more than 30 degrees:
 atom 1 atom 2  angle  previous, current, constraint length

Step 0, time 0 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 4.099146, max 40.933979 (between atoms 541 and 543)
bonds that rotated more than 30 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
[vayu1:10085] *** Process received signal ***
[vayu1:10086] *** Process received signal ***
[vayu1:10085] Signal: Segmentation fault (11)
[vayu1:10085] Signal code:  (128)
[vayu1:10085] Failing at address: (nil)
[vayu1:10086] Signal: Segmentation fault (11)
[vayu1:10086] Signal code:  (128)
[vayu1:10086] Failing at address: (nil)
[vayu1:10086] [ 0] /lib64/libpthread.so.0 [0x7ffff5120b10]
[vayu1:10086] [ 1] /apps/openmpi/1.4.3/lib/libopen-pal.so.0(opal_memory_ptmalloc2_int_malloc+0x74a) [0x7ffff74bceba]
[vayu1:10086] [ 2] /apps/openmpi/1.4.3/lib/libopen-pal.so.0 [0x7ffff74be813]
[vayu1:10086] [ 3] /lib64/libc.so.6(__libc_calloc+0x330) [0x7ffff4675d10]
[vayu1:10086] [ 4] mdrun_mpi(save_calloc+0x32) [0x57b362]
[vayu1:10086] [ 5] mdrun_mpi(gmx_fio_fopen+0x129) [0x54afb9]
[vayu1:10086] [ 6] mdrun_mpi(constrain+0xdd4) [0x43e494]
[vayu1:10086] [ 7] mdrun_mpi(do_constrain_first+0x18f) [0x4e573f]
[vayu1:10086] [ 8] mdrun_mpi(do_md+0xe42) [0x428712]
[vayu1:10086] [ 9] mdrun_mpi(mdrunner+0x115a) [0x41ebba]
[vayu1:10086] [10] mdrun_mpi(main+0xa8a) [0x42dd1a]
[vayu1:10086] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x7ffff461e994]
[vayu1:10086] [12] mdrun_mpi [0x418629]
[vayu1:10086] *** End of error message ***
[vayu1:10085] [ 0] /lib64/libpthread.so.0 [0x7ffff5120b10]
[vayu1:10085] [ 1] /apps/openmpi/1.4.3/lib/libopen-pal.so.0(opal_memory_ptmalloc2_int_malloc+0x74a) [0x7ffff74bceba]
[vayu1:10085] [ 2] /apps/openmpi/1.4.3/lib/libopen-pal.so.0 [0x7ffff74be813]
[vayu1:10085] [ 3] /lib64/libc.so.6(__libc_calloc+0x330) [0x7ffff4675d10]
[vayu1:10085] [ 4] mdrun_mpi(save_calloc+0x32) [0x57b362]
[vayu1:10085] [ 5] mdrun_mpi(gmx_fio_fopen+0x129) [0x54afb9]
[vayu1:10085] [ 6] mdrun_mpi(constrain+0xdd4) [0x43e494]
[vayu1:10085] [ 7] mdrun_mpi(do_constrain_first+0x18f) [0x4e573f]
[vayu1:10085] [ 8] mdrun_mpi(do_md+0xe42) [0x428712]
[vayu1:10085] [ 9] mdrun_mpi(mdrunner+0x115a) [0x41ebba]
[vayu1:10085] [10] mdrun_mpi(main+0xa8a) [0x42dd1a]
[vayu1:10085] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x7ffff461e994]
[vayu1:10085] [12] mdrun_mpi [0x418629]
[vayu1:10085] *** End of error message ***

But with

mpirun -np 3 mdrun_mpi -deffnm temp -s md_test

I get

Reading file md_test.tpr, VERSION 4.5.4 (single precision)

Back Off! I just backed up temp.xtc to ./#temp.xtc.8#

Back Off! I just backed up temp.edr to ./#temp.edr.8#

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.5.4-dev
Source code file: /home/224/mxa224/builds/gromacs_builds/git/release-4-5-patches/src/gmxlib/splitter.c, line: 160

Fatal error:
Constraint dependencies further away than next-neighbor
in particle decomposition. Constraint between atoms 1916--1918 evaluated
on node 2 and 2, but atom 1916 has connections within 4 bonds (lincs_order)
of node 0, and atom 1918 has connections within 4 bonds of node 2.
Reduce the # nodes, lincs_order, or
try domain decomposition.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

"If I Were You I Would Give Me a Break" (F. Black)

Error on node 2, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 2 out of 3

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.5.4-dev
Source code file: /home/224/mxa224/builds/gromacs_builds/git/release-4-5-patches/src/gmxlib/splitter.c, line: 160

Fatal error:
Constraint dependencies further away than next-neighbor
in particle decomposition. Constraint between atoms 1916--1918 evaluated
on node 2 and 2, but atom 1916 has connections within 4 bonds (lincs_order)
of node 0, and atom 1918 has connections within 4 bonds of node 2.
Reduce the # nodes, lincs_order, or
try domain decomposition.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

The former seems to have issues with opening a file, and the latter is struggling with constraints. I don't know how to reconcile either of these with the symptoms Justin reported on the mailing list.

#3 Updated by Mark Abraham over 8 years ago

Also, this was with icc 11.1 and OpenMPI 1.4.3

#4 Updated by Berk Hess about 8 years ago

  • Assignee set to Erik Lindahl

#5 Updated by Rossen Apostolov about 8 years ago

  • Priority changed from Normal to 6

#6 Updated by Berk Hess about 8 years ago

  • Status changed from New to 3

I fixed the segmentation fault, there was a fixed buffer allocation of 1000 (YUCK!!!), with comment:
/* This should really be calculated, but 1000 is a lot for overlapping constraints... */
I don't really understand this code, so I don't know if the restriction causing the fatal error is easy to relieve. So for now you'll have to run it on 2 cores.

#7 Updated by Rossen Apostolov about 8 years ago

  • Status changed from 3 to Closed

Also available in: Atom PDF