Project

General

Profile

Bug #2095

Seg Fault when running flat-bottom position restraints with MPI

Added by Yunlong Liu almost 3 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I compiled gromacs (git master branch & 2016.1 release) with the following settings:

+ GCC 5.2.0 / GCC 4.9.2
+ OpenMpi 2.0.1 / Mpich 3.2
+ OpenMP enabled
+ FFTW 3.3.5
+ AVX2_256
+ CUDA 7.5
+ CUDA_HOST_COMPILER 4.9.2

In my position restraint topology files, I applied flat-bottom position restraints to three atoms. But when I started my gromacs job using

mpirun -np 4 gmx_mpi mdrun ...


The OpenMPI outputs a seg fault:
[gpu072:50339] *** Process received signal ***
[gpu072:50339] Signal: Segmentation fault (11)
[gpu072:50339] Signal code: Address not mapped (1)
[gpu072:50339] Failing at address: (nil)
[gpu072:50338] *** Process received signal ***
[gpu072:50338] Signal: Segmentation fault (11)
[gpu072:50338] Signal code: Address not mapped (1)
[gpu072:50338] Failing at address: (nil)
[gpu072:50339] [ 0] /lib64/libpthread.so.0(+0xf790)[0x2aaaaf001790]
[gpu072:50339] [ 1] [gpu072:50338] [ 0] /lib64/libpthread.so.0(+0xf790)[0x2aaaaf001790]
[gpu072:50338] [ 1] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x49662b)[0x2aaaab16362b]
[gpu072:50339] [ 2] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x49662b)[0x2aaaab16362b]
[gpu072:50338] [ 2] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x497fe2)[0x2aaaab164fe2]
[gpu072:50339] [ 3] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x497fe2)[0x2aaaab164fe2]
[gpu072:50338] [ 3] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z17dd_make_local_topP12gmx_domdec_tP18gmx_domdec_zones_tiPA3_fPfPiP10t_forcerecS4_P11gmx_vsite_tPK10gmx_mtop_tP14gmx_localtop_t+0x354)[0x2aaaab1654bd]
[gpu072:50339] [ 4] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z17dd_make_local_topP12gmx_domdec_tP18gmx_domdec_zones_tiPA3_fPfPiP10t_forcerecS4_P11gmx_vsite_tPK10gmx_mtop_tP14gmx_localtop_t+0x354)[0x2aaaab1654bd]
[gpu072:50338] [ 4] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z19dd_partition_systemP8_IO_FILElP9t_commreciiP7t_statePK10gmx_mtop_tPK10t_inputrecS4_PSt6vectorIN3gmx11BasicVectorIfEESaISE_EEP9t_mdatomsP14gmx_localtop_tP10t_forcerecP11gmx_vsite_tP10gmx_constrP6t_nrnbP13gmx_wallcyclei+0x1464)[0x2aaaab15c890]
[gpu072:50339] [ 5] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_Z19dd_partition_systemP8_IO_FILElP9t_commreciiP7t_statePK10gmx_mtop_tPK10t_inputrecS4_PSt6vectorIN3gmx11BasicVectorIfEESaISE_EEP9t_mdatomsP14gmx_localtop_tP10t_forcerecP11gmx_vsite_tP10gmx_constrP6t_nrnbP13gmx_wallcyclei+0x1464)[0x2aaaab15c890]
[gpu072:50338] [ 5] gmx_mpi[0x429f6e]
[gpu072:50339] [ 6] gmx_mpi[0x423b91]
[gpu072:50339] [ 7] gmx_mpi[0x429f6e]
[gpu072:50338] [ 6] gmx_mpi[0x423b91]
[gpu072:50338] [ 7] gmx_mpi[0x428150]
[gpu072:50339] [ 8] gmx_mpi[0x428150]
[gpu072:50338] [ 8] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x452977)[0x2aaaab11f977]
[gpu072:50339] [ 9] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(+0x452977)[0x2aaaab11f977]
[gpu072:50338] [ 9] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x38d)[0x2aaaab12142d]
[gpu072:50339] [10] /home-4/yliu120@jhu.edu/opt2/lib64/libgromacs_mpi.so.3(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x38d)[0x2aaaab12142d]
[gpu072:50338] [10] gmx_mpi[0x41941c]
[gpu072:50338] [11] gmx_mpi[0x41941c]
[gpu072:50339] [11] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aaaaf22dd5d]
[gpu072:50338] [12] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aaaaf22dd5d]
[gpu072:50339] [12] gmx_mpi[0x419299]
[gpu072:50338] *** End of error message ***
gmx_mpi[0x419299]
[gpu072:50339] *** End of error message ***

The OpenMPI's debugger stacktrace shows that it is in the do_make_local_top() function in the domdec.h outputs this segfault.

However, when I removed the mpirun, in other words, when I ran the tpr using only one process with multiple threads, I didn't get any seg fault.

I attached the tpr file that can trigger this seg fault.

step6.5_equilibration.tpr (7.42 MB) step6.5_equilibration.tpr Yunlong Liu, 12/29/2016 05:58 AM

Related issues

Has duplicate GROMACS - Bug #2236: FEP calculation with flat bottom restraintsClosed

Associated revisions

Revision 9a45db56 (diff)
Added by Berk Hess almost 3 years ago

Fix flat-bottom position restraints + DD + OpenMP

When using flat-bottom position restraints with DD and OpenMP
a (re)allocation was missing, causing a segv.

Fixes #2095.

Change-Id: I03af546a0b8d03a3d384d86a2582a67584e72d46

History

#1 Updated by Yunlong Liu almost 3 years ago

Just simply debugging this problem, a mpi process doesn't go through the OMP for loop here:

https://github.com/gromacs/gromacs/blob/7dffe13ebf80f29197e83d554493e8036c819a61/src/gromacs/domdec/domdec_topology.cpp#L2109

#2 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related patchset '1' for Issue #2095.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2016~I03af546a0b8d03a3d384d86a2582a67584e72d46
Gerrit URL: https://gerrit.gromacs.org/6397

#3 Updated by Berk Hess almost 3 years ago

  • Category set to mdrun
  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess

#4 Updated by Mark Abraham almost 3 years ago

  • Description updated (diff)

#5 Updated by Berk Hess almost 3 years ago

  • Status changed from Fix uploaded to Resolved

#6 Updated by Mark Abraham almost 3 years ago

  • Status changed from Resolved to Closed

#7 Updated by Mark Abraham about 2 years ago

  • Has duplicate Bug #2236: FEP calculation with flat bottom restraints added

#8 Updated by j diaz over 1 year ago

Has this been already fixed?

#9 Updated by Mark Abraham over 1 year ago

j diaz wrote:

Has this been already fixed?

Yes, in 2016.2 (check its release notes to be sure)

Also available in: Atom PDF