Project

General

Profile

Bug #1034

Gromacs 4.6 segmentation fault with mdrun

Added by Sebastian Waltz over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
High
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

OS: Debian linux version 6.0.6
compiler: gcc version 4.4
results of runs with the following configurations:
- "mdrun -nb cpu" (to run CPU-only with Verlet scheme)
segmentation fault when starting the actual simulation
(verlet_cpu_only.log verlet_cpu_only.debug)

- "GMX_EMULATE_GPU=1 mdrun -nb gpu" (to run GPU emulation using plain C kernels);
segmentation fault when starting the actual simulation (gpu_emulation.*)

- "mdrun" without any arguments (which will use 2x(n/2 cores + 1 GPU))
segmentation fault when starting the actual simulation (no_arguments.*)

- "mdrun -ntmpi 1" without any other arguments (which will use n cores + the first GPU)
segmentation fault when starting the actual simulation (ntmpi.*)

The working run is without the verlet cut-off scheme (without_verlet_cut_off.*)

log_files.tar.gz (273 KB) log_files.tar.gz log and debug files from the corresponding mdrun runs Sebastian Waltz, 11/13/2012 05:37 PM
run_files.tar.gz (477 KB) run_files.tar.gz files used to generate the run input files Sebastian Waltz, 11/13/2012 05:37 PM

History

#1 Updated by Szilárd Páll over 6 years ago

  • Description updated (diff)
  • Priority changed from Normal to High

Based on the mdrun.debug outputs its seems that the crash happens in the NxN pair-search.

Related discussion on the gmx-users:
http://www.mail-archive.com/gmx-users@gromacs.org/index.html#55454

#2 Updated by Berk Hess over 6 years ago

I can't reproduce any of these crashes on my Sandy Bridge + GPU machine.
But we just found a strange issue with the current git code, which has not been fully resolved yet.
Could you try replacing gmx_erfd in src/mdlib/tables.c by erfd?
This might help.

#3 Updated by Sebastian Waltz over 6 years ago

Berk Hess wrote:

I can't reproduce any of these crashes on my Sandy Bridge + GPU machine.
But we just found a strange issue with the current git code, which has not been fully resolved yet.
Could you try replacing gmx_erfd in src/mdlib/tables.c by erfd?
This might help.

When it does not compile any longer. (undefined reference to `erfd')

../mdlib/libmd.so.6: undefined reference to `erfd'
collect2: ld returned 1 exit status
../mdlib/libmd.so.6: undefined reference to `erfd'
collect2: ld returned 1 exit status
make2: * [src/kernel/g_luck] Error 1
make1:
[src/kernel/CMakeFiles/g_luck.dir/all] Error 2
make1:
Waiting for unfinished jobs....
make2:
[src/kernel/g_x2top] Error 1
make1:
[src/kernel/CMakeFiles/g_x2top.dir/all] Error 2
../mdlib/libmd.so.6: undefined reference to `erfd'
collect2: ld returned 1 exit status
make2:
[src/kernel/g_protonate] Error 1
make1:
[src/kernel/CMakeFiles/g_protonate.dir/all] Error 2
[ 84%] Built target gmxana
make: *
[all] Error 2

#4 Updated by Berk Hess over 6 years ago

Sorry, that should be erf, not erfd.

#5 Updated by Sebastian Waltz over 6 years ago

Sebastian Waltz wrote:
Berk Hess wrote:

I can't reproduce any of these crashes on my Sandy Bridge + GPU machine.
But we just found a strange issue with the current git code, which has not been fully resolved yet.
Could you try replacing gmx_erfd in src/mdlib/tables.c by erfd?
This might help.

Replacing it with erf it compiles but I still get the same segfault

#6 Updated by Berk Hess over 6 years ago

I have finally managed to run a memory checker and it gave one, unrelated error. So I have no clue what the issue is you are experiencing. We have had OpenMP issues with old gcc versions.
Could you try reconfiguring and recompiling with -DGMX_OPENMP=off to check if that might be the cause?
Installing a newer version of gcc will anyhow improve performance. We would recommend gcc 4.7.

#7 Updated by Sebastian Waltz over 6 years ago

Berk Hess wrote:

I have finally managed to run a memory checker and it gave one, unrelated error. So I have no clue what the issue is you are experiencing. We have had OpenMP issues with old gcc versions.
Could you try reconfiguring and recompiling with -DGMX_OPENMP=off to check if that might be the cause?
Installing a newer version of gcc will anyhow improve performance. We would recommend gcc 4.7.

I compiled GROMACS with the -DGMX_OPENMP=off option and get still the same segfault. Since I am running debian 6 on my system an update to gcc 4.7 is hard to do and ends up in all sorts of dependency problems. On the WE I will try to get the gcc 4.7 running and will compile GROMACS with it.
It seems that not only I have this problems ( http://www.mail-archive.com/gmx-users@gromacs.org/msg55541.html )

#8 Updated by Berk Hess over 6 years ago

Something that would help a lot is to get a full backtrace.
Could you switch to debug mode: use ccmake, change the buildtype from Release to Debug
and then do:
gdb mdrun
type: run
wait until it crashes, then
type: where
and send me the result?

#9 Updated by Sebastian Waltz over 6 years ago

Sebastian Waltz wrote:

Berk Hess wrote:

I have finally managed to run a memory checker and it gave one, unrelated error. So I have no clue what the issue is you are experiencing. We have had OpenMP issues with old gcc versions.
Could you try reconfiguring and recompiling with -DGMX_OPENMP=off to check if that might be the cause?
Installing a newer version of gcc will anyhow improve performance. We would recommend gcc 4.7.

I compiled GROMACS with the -DGMX_OPENMP=off option and get still the same segfault. Since I am running debian 6 on my system an update to gcc 4.7 is hard to do and ends up in all sorts of dependency problems. On the WE I will try to get the gcc 4.7 running and will compile GROMACS with it.
It seems that not only I have this problems ( http://www.mail-archive.com/gmx-users@gromacs.org/msg55541.html )

Please do not ask me why, but the cmake -DCMAKE_BUILD_TYPE:STRING=Debug flag solved the problem. I tried it again with a gromacs version compiled without the flag and got the same segfault as before. Including the flag again solved the issue again.

Thanks a lot

#10 Updated by Roland Schulz over 6 years ago

Could you attache your CMakeCache.txt so we can see what acceleration level is chosen by cmake? Also you could compile with Release but then add a "-g" to CMAKE_C_FLAGS_RELEASE. That way you should be able to get a stack trace with gdb.

#11 Updated by Szilárd Páll over 6 years ago

Everything is in the log files he attached:

Detecting CPU-specific acceleration.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
Family:  6  Model: 45  Stepping:  7
Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
Acceleration most likely to fit this hardware: AVX_256
Acceleration selected at GROMACS compile time: AVX_256

2 GPUs detected:
  #0: NVIDIA GeForce GTX 670, compute cap.: 3.0, ECC:  no, stat: compatible
  #1: NVIDIA GeForce GTX 670, compute cap.: 3.0, ECC:  no, stat: compatible

#12 Updated by Roland Schulz over 6 years ago

I can reproduce it with the version you used (from branch nbnxn_hybrid_acc). Please recompile using branch release-4-6 and test whether that fixes it.

#13 Updated by Szilárd Páll over 6 years ago

Roland Schulz wrote:

I can reproduce it with the version you used (from branch nbnxn_hybrid_acc). Please recompile using branch release-4-6 and test whether that fixes it.

Good catch, I should have checked the version. I wish we had the (over and over discussed) compulsory version field which would have forced the reporter to explicitly state the version.

#14 Updated by Roland Schulz over 6 years ago

What exactly do you mean with "version field". Does it have a redmine issue? If not could you create one and summarize the discussion?

#15 Updated by Sebastian Waltz over 6 years ago

Roland Schulz wrote:

I can reproduce it with the version you used (from branch nbnxn_hybrid_acc). Please recompile using branch release-4-6 and test whether that fixes it.

The release version works perfectly fine.
Sorry

#16 Updated by Szilárd Páll over 6 years ago

Roland Schulz wrote:

What exactly do you mean with "version field". Does it have a redmine issue? If not could you create one and summarize the discussion?

http://redmine.gromacs.org/issues/689

The "Detected in" drop-down field was removed without much discussion 1.5-2 years ago and has never been re-added. With that not only information of some (then) existing bugs was trashed, but we also crippled the database: since then almost nobody adds version information to the bugs.

#17 Updated by Roland Schulz over 6 years ago

Thanks. I misunderstood you and I thought you were talking about some better reporting of the version number in the mdrun log/output (which I think is OK - so I was confused).

#18 Updated by Roland Schulz over 6 years ago

  • Status changed from New to Closed

Also available in: Atom PDF