Bug #777
Implicit solvent crashes with particle decomposition
Description
I'm testing some methodology with a robust system (everyone's favorite, 1AKI lysozyme). Running simulations with finite cutoffs works fine (in serial or parallel using DD), but if I try to call the all-vs-all kernels (e.g. infinite cutoffs), the simulations instantly crash (at step 0) with LINCS warnings. Running in serial, however, produces perfectly stable trajectories using all-vs-all, but the simulations are very slow. I can also add that finite cutoff simulations fail under PD, as well.
The effects are independent of integrator (tested sd and md) and hardware/compilers. The problem is reproducible on two very different systems:
1. x86_64 via OpenMPI 1.4.2 and compiled with cmake with gcc-4.3.4
2. Mac OSX with threads and compiled via autoconf with gcc-4.4.4
All force fields that I've tested (OPLS-AA, CHARMM27, and AMBER03) give the same result. The attached .tpr file uses the AMBER03 force field.
Associated revisions
Merge "made allocation for LINCS with particle decomposition dynamic, fixes #777" into release-4-5-patches
History
#1 Updated by Justin Lemkul over 9 years ago
- Assignee deleted (
Berk Hess)
#2 Updated by Mark Abraham over 9 years ago
Single processor also works fine for me.
With
mpirun -np 2 mdrun_mpi -deffnm temp -s md_test
I get
Reading file md_test.tpr, VERSION 4.5.4 (single precision) Back Off! I just backed up temp.xtc to ./#temp.xtc.7# Back Off! I just backed up temp.edr to ./#temp.edr.7# Step 0, time 0 (ps) LINCS WARNING relative constraint deviation after LINCS: rms 0.584110, max 12.815955 (between atoms 466 and 469) bonds that rotated more than 30 degrees: atom 1 atom 2 angle previous, current, constraint length Step 0, time 0 (ps) LINCS WARNING relative constraint deviation after LINCS: rms 4.099146, max 40.933979 (between atoms 541 and 543) bonds that rotated more than 30 degrees: atom 1 atom 2 angle previous, current, constraint length [vayu1:10085] *** Process received signal *** [vayu1:10086] *** Process received signal *** [vayu1:10085] Signal: Segmentation fault (11) [vayu1:10085] Signal code: (128) [vayu1:10085] Failing at address: (nil) [vayu1:10086] Signal: Segmentation fault (11) [vayu1:10086] Signal code: (128) [vayu1:10086] Failing at address: (nil) [vayu1:10086] [ 0] /lib64/libpthread.so.0 [0x7ffff5120b10] [vayu1:10086] [ 1] /apps/openmpi/1.4.3/lib/libopen-pal.so.0(opal_memory_ptmalloc2_int_malloc+0x74a) [0x7ffff74bceba] [vayu1:10086] [ 2] /apps/openmpi/1.4.3/lib/libopen-pal.so.0 [0x7ffff74be813] [vayu1:10086] [ 3] /lib64/libc.so.6(__libc_calloc+0x330) [0x7ffff4675d10] [vayu1:10086] [ 4] mdrun_mpi(save_calloc+0x32) [0x57b362] [vayu1:10086] [ 5] mdrun_mpi(gmx_fio_fopen+0x129) [0x54afb9] [vayu1:10086] [ 6] mdrun_mpi(constrain+0xdd4) [0x43e494] [vayu1:10086] [ 7] mdrun_mpi(do_constrain_first+0x18f) [0x4e573f] [vayu1:10086] [ 8] mdrun_mpi(do_md+0xe42) [0x428712] [vayu1:10086] [ 9] mdrun_mpi(mdrunner+0x115a) [0x41ebba] [vayu1:10086] [10] mdrun_mpi(main+0xa8a) [0x42dd1a] [vayu1:10086] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x7ffff461e994] [vayu1:10086] [12] mdrun_mpi [0x418629] [vayu1:10086] *** End of error message *** [vayu1:10085] [ 0] /lib64/libpthread.so.0 [0x7ffff5120b10] [vayu1:10085] [ 1] /apps/openmpi/1.4.3/lib/libopen-pal.so.0(opal_memory_ptmalloc2_int_malloc+0x74a) [0x7ffff74bceba] [vayu1:10085] [ 2] /apps/openmpi/1.4.3/lib/libopen-pal.so.0 [0x7ffff74be813] [vayu1:10085] [ 3] /lib64/libc.so.6(__libc_calloc+0x330) [0x7ffff4675d10] [vayu1:10085] [ 4] mdrun_mpi(save_calloc+0x32) [0x57b362] [vayu1:10085] [ 5] mdrun_mpi(gmx_fio_fopen+0x129) [0x54afb9] [vayu1:10085] [ 6] mdrun_mpi(constrain+0xdd4) [0x43e494] [vayu1:10085] [ 7] mdrun_mpi(do_constrain_first+0x18f) [0x4e573f] [vayu1:10085] [ 8] mdrun_mpi(do_md+0xe42) [0x428712] [vayu1:10085] [ 9] mdrun_mpi(mdrunner+0x115a) [0x41ebba] [vayu1:10085] [10] mdrun_mpi(main+0xa8a) [0x42dd1a] [vayu1:10085] [11] /lib64/libc.so.6(__libc_start_main+0xf4) [0x7ffff461e994] [vayu1:10085] [12] mdrun_mpi [0x418629] [vayu1:10085] *** End of error message ***
But with
mpirun -np 3 mdrun_mpi -deffnm temp -s md_test
I get
Reading file md_test.tpr, VERSION 4.5.4 (single precision) Back Off! I just backed up temp.xtc to ./#temp.xtc.8# Back Off! I just backed up temp.edr to ./#temp.edr.8# ------------------------------------------------------- Program mdrun_mpi, VERSION 4.5.4-dev Source code file: /home/224/mxa224/builds/gromacs_builds/git/release-4-5-patches/src/gmxlib/splitter.c, line: 160 Fatal error: Constraint dependencies further away than next-neighbor in particle decomposition. Constraint between atoms 1916--1918 evaluated on node 2 and 2, but atom 1916 has connections within 4 bonds (lincs_order) of node 0, and atom 1918 has connections within 4 bonds of node 2. Reduce the # nodes, lincs_order, or try domain decomposition. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ------------------------------------------------------- "If I Were You I Would Give Me a Break" (F. Black) Error on node 2, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 2 out of 3 ------------------------------------------------------- Program mdrun_mpi, VERSION 4.5.4-dev Source code file: /home/224/mxa224/builds/gromacs_builds/git/release-4-5-patches/src/gmxlib/splitter.c, line: 160 Fatal error: Constraint dependencies further away than next-neighbor in particle decomposition. Constraint between atoms 1916--1918 evaluated on node 2 and 2, but atom 1916 has connections within 4 bonds (lincs_order) of node 0, and atom 1918 has connections within 4 bonds of node 2. Reduce the # nodes, lincs_order, or try domain decomposition. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors -------------------------------------------------------
The former seems to have issues with opening a file, and the latter is struggling with constraints. I don't know how to reconcile either of these with the symptoms Justin reported on the mailing list.
#3 Updated by Mark Abraham over 9 years ago
Also, this was with icc 11.1 and OpenMPI 1.4.3
#4 Updated by Berk Hess over 9 years ago
- Assignee set to Erik Lindahl
#5 Updated by Rossen Apostolov over 9 years ago
- Priority changed from Normal to 6
#6 Updated by Berk Hess over 9 years ago
- Status changed from New to 3
I fixed the segmentation fault, there was a fixed buffer allocation of 1000 (YUCK!!!), with comment:
/* This should really be calculated, but 1000 is a lot for overlapping constraints... */
I don't really understand this code, so I don't know if the restriction causing the fatal error is easy to relieve. So for now you'll have to run it on 2 cores.
#7 Updated by Rossen Apostolov over 9 years ago
- Status changed from 3 to Closed
made allocation for LINCS with particle decomposition dynamic, fixes #777
Change-Id: I070b5f8917d12ab64896f62ecc80ed38b011f066