Project

General

Profile

Bug #256

MD simulation crashes on 8 CPUs but not on 1 CPU

Added by Andreas Kring almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Created an attachment (id=324)
tpr file

Gromacs version: 4.0.2 (double precision)

MD simulation of system consisting of 999 polarizable water molecules and 1 hydroxide ion (tpr-file attached). The simulation works when it is running on 1 CPU, but when using 8 CPUs it crashes. The following commands were used:

$ grompp_d -c system.gro -p topology.top -f NVT.mdp

followed by

$ mpirun -np 8 mdrun_d -nice 11

This generates the following error:

Reading file topol.tpr, VERSION 4.0.2 (double precision)
NNODES=8, MYRANK=2, HOSTNAME=yoda
NODEID=2 argc=3

Will use 6 particle-particle and 2 PME only nodes
This is a guess, check the performance at the end of the log file
Making 1D domain decomposition 6 x 1 x 1
starting mdrun 'Hydroxide in POL3 adapted to GROMACS'
10000 steps, 10.0 ps.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x200852d80
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x200852d80
[0] func:/usr/lib/libopal.so.0 [0x2ada7a4e2f74]
[1] func:/lib/libpthread.so.0 [0x2ada7ab2d7d0]
[2] func:/usr/local/gromacs-4.0.2/lib/libmd_mpi_d.so.5(gmx_pme_do+0x5ef) [0x2ada7907bbaf]
[3] func:/usr/local/gromacs-4.0.2/lib/libmd_mpi_d.so.5(gmx_pmeonly+0x25c) [0x2ada7907ed9c]
[4] func:mdrun4_d(mdrunner+0xe49) [0x41d7c9]
[5] func:mdrun4_d(main+0x3a7) [0x423eb7]
[6] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2ada7ad591c4]
[7] func:mdrun4_d(do_cg+0x1a1) [0x405319]
  • End of error message ***
    [0] func:/usr/lib/libopal.so.0 [0x2b6e57e6df74]
    [1] func:/lib/libpthread.so.0 [0x2b6e584b87d0]
    [2] func:/usr/local/gromacs-4.0.2/lib/libmd_mpi_d.so.5(gmx_pme_do+0x5ef) [0x2b6e56a06baf]
    [3] func:/usr/local/gromacs-4.0.2/lib/libmd_mpi_d.so.5(gmx_pmeonly+0x25c) [0x2b6e56a09d9c]
    [4] func:mdrun4_d(mdrunner+0xe49) [0x41d7c9]
    [5] func:mdrun4_d(main+0x3a7) [0x423eb7]
    [6] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b6e586e41c4]
    [7] func:mdrun4_d(do_cg+0x1a1) [0x405319]
  • End of error message ***
    1 additional process aborted (not shown)

Regards
Andreas

topol.tpr (284 KB) topol.tpr tpr file Andreas Kring, 11/11/2008 11:16 AM

History

#1 Updated by Berk Hess almost 11 years ago

The problem has to do with the shell prediction.
I have not found out what the problem is exactly,
but for the moment you can turn of the prediction
with the environment variable GMX_NOPREDICT.
In my case:
mpirun -np 8 -x GMX_NOPREDICT=1 mdrun

Berk

#2 Updated by Berk Hess almost 11 years ago

I fixed this bug for the 4.0.3 release.

If you want the fix now (turning off shell prediction
makes your simulation 50% slower, so I guess you want it),
you should move line 478 of src/mdlib/shellfc.c
shell[nshell].shell = i;
to after the if (...) {...}, to just before nshell++.

Berk

#3 Updated by Andreas Kring almost 11 years ago

Great! Once again thank you for fixing the bug so fast!
I'll move the line as you described.

Thanks
Andreas

Also available in: Atom PDF