Bug #1034
Gromacs 4.6 segmentation fault with mdrun
Description
OS: Debian linux version 6.0.6
compiler: gcc version 4.4
results of runs with the following configurations:
- "mdrun -nb cpu" (to run CPU-only with Verlet scheme)
segmentation fault when starting the actual simulation
(verlet_cpu_only.log verlet_cpu_only.debug)
- "GMX_EMULATE_GPU=1 mdrun -nb gpu" (to run GPU emulation using plain C kernels);
segmentation fault when starting the actual simulation (gpu_emulation.*)
- "mdrun" without any arguments (which will use 2x(n/2 cores + 1 GPU))
segmentation fault when starting the actual simulation (no_arguments.*)
- "mdrun -ntmpi 1" without any other arguments (which will use n cores + the first GPU)
segmentation fault when starting the actual simulation (ntmpi.*)
The working run is without the verlet cut-off scheme (without_verlet_cut_off.*)
History
#1 Updated by Szilárd Páll about 8 years ago
- Description updated (diff)
- Priority changed from Normal to High
Based on the mdrun.debug outputs its seems that the crash happens in the NxN pair-search.
Related discussion on the gmx-users:
http://www.mail-archive.com/gmx-users@gromacs.org/index.html#55454
#2 Updated by Berk Hess about 8 years ago
I can't reproduce any of these crashes on my Sandy Bridge + GPU machine.
But we just found a strange issue with the current git code, which has not been fully resolved yet.
Could you try replacing gmx_erfd in src/mdlib/tables.c by erfd?
This might help.
#3 Updated by Sebastian Waltz about 8 years ago
Berk Hess wrote:
I can't reproduce any of these crashes on my Sandy Bridge + GPU machine.
But we just found a strange issue with the current git code, which has not been fully resolved yet.
Could you try replacing gmx_erfd in src/mdlib/tables.c by erfd?
This might help.
When it does not compile any longer. (undefined reference to `erfd')
../mdlib/libmd.so.6: undefined reference to `erfd'
collect2: ld returned 1 exit status
../mdlib/libmd.so.6: undefined reference to `erfd'
collect2: ld returned 1 exit status
make2: * [src/kernel/g_luck] Error 1
make1: [src/kernel/CMakeFiles/g_luck.dir/all] Error 2
make1: Waiting for unfinished jobs....
make2: [src/kernel/g_x2top] Error 1
make1: [src/kernel/CMakeFiles/g_x2top.dir/all] Error 2
../mdlib/libmd.so.6: undefined reference to `erfd'
collect2: ld returned 1 exit status
make2: [src/kernel/g_protonate] Error 1
make1: [src/kernel/CMakeFiles/g_protonate.dir/all] Error 2
[ 84%] Built target gmxana
make: * [all] Error 2
#4 Updated by Berk Hess about 8 years ago
Sorry, that should be erf, not erfd.
#5 Updated by Sebastian Waltz about 8 years ago
Sebastian Waltz wrote:
Berk Hess wrote:
I can't reproduce any of these crashes on my Sandy Bridge + GPU machine.
But we just found a strange issue with the current git code, which has not been fully resolved yet.
Could you try replacing gmx_erfd in src/mdlib/tables.c by erfd?
This might help.
Replacing it with erf it compiles but I still get the same segfault
#6 Updated by Berk Hess about 8 years ago
I have finally managed to run a memory checker and it gave one, unrelated error. So I have no clue what the issue is you are experiencing. We have had OpenMP issues with old gcc versions.
Could you try reconfiguring and recompiling with -DGMX_OPENMP=off to check if that might be the cause?
Installing a newer version of gcc will anyhow improve performance. We would recommend gcc 4.7.
#7 Updated by Sebastian Waltz about 8 years ago
Berk Hess wrote:
I have finally managed to run a memory checker and it gave one, unrelated error. So I have no clue what the issue is you are experiencing. We have had OpenMP issues with old gcc versions.
Could you try reconfiguring and recompiling with -DGMX_OPENMP=off to check if that might be the cause?
Installing a newer version of gcc will anyhow improve performance. We would recommend gcc 4.7.
I compiled GROMACS with the -DGMX_OPENMP=off option and get still the same segfault. Since I am running debian 6 on my system an update to gcc 4.7 is hard to do and ends up in all sorts of dependency problems. On the WE I will try to get the gcc 4.7 running and will compile GROMACS with it.
It seems that not only I have this problems ( http://www.mail-archive.com/gmx-users@gromacs.org/msg55541.html )
#8 Updated by Berk Hess about 8 years ago
Something that would help a lot is to get a full backtrace.
Could you switch to debug mode: use ccmake, change the buildtype from Release to Debug
and then do:
gdb mdrun
type: run
wait until it crashes, then
type: where
and send me the result?
#9 Updated by Sebastian Waltz about 8 years ago
Sebastian Waltz wrote:
Berk Hess wrote:
I have finally managed to run a memory checker and it gave one, unrelated error. So I have no clue what the issue is you are experiencing. We have had OpenMP issues with old gcc versions.
Could you try reconfiguring and recompiling with -DGMX_OPENMP=off to check if that might be the cause?
Installing a newer version of gcc will anyhow improve performance. We would recommend gcc 4.7.I compiled GROMACS with the -DGMX_OPENMP=off option and get still the same segfault. Since I am running debian 6 on my system an update to gcc 4.7 is hard to do and ends up in all sorts of dependency problems. On the WE I will try to get the gcc 4.7 running and will compile GROMACS with it.
It seems that not only I have this problems ( http://www.mail-archive.com/gmx-users@gromacs.org/msg55541.html )
Please do not ask me why, but the cmake -DCMAKE_BUILD_TYPE:STRING=Debug flag solved the problem. I tried it again with a gromacs version compiled without the flag and got the same segfault as before. Including the flag again solved the issue again.
Thanks a lot
#10 Updated by Roland Schulz about 8 years ago
Could you attache your CMakeCache.txt so we can see what acceleration level is chosen by cmake? Also you could compile with Release but then add a "-g" to CMAKE_C_FLAGS_RELEASE. That way you should be able to get a stack trace with gdb.
#11 Updated by Szilárd Páll about 8 years ago
Everything is in the log files he attached:
Detecting CPU-specific acceleration. Present hardware specification: Vendor: GenuineIntel Brand: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz Family: 6 Model: 45 Stepping: 7 Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic Acceleration most likely to fit this hardware: AVX_256 Acceleration selected at GROMACS compile time: AVX_256 2 GPUs detected: #0: NVIDIA GeForce GTX 670, compute cap.: 3.0, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 670, compute cap.: 3.0, ECC: no, stat: compatible
#12 Updated by Roland Schulz about 8 years ago
I can reproduce it with the version you used (from branch nbnxn_hybrid_acc). Please recompile using branch release-4-6 and test whether that fixes it.
#13 Updated by Szilárd Páll about 8 years ago
Roland Schulz wrote:
I can reproduce it with the version you used (from branch nbnxn_hybrid_acc). Please recompile using branch release-4-6 and test whether that fixes it.
Good catch, I should have checked the version. I wish we had the (over and over discussed) compulsory version field which would have forced the reporter to explicitly state the version.
#14 Updated by Roland Schulz about 8 years ago
What exactly do you mean with "version field". Does it have a redmine issue? If not could you create one and summarize the discussion?
#15 Updated by Sebastian Waltz about 8 years ago
Roland Schulz wrote:
I can reproduce it with the version you used (from branch nbnxn_hybrid_acc). Please recompile using branch release-4-6 and test whether that fixes it.
The release version works perfectly fine.
Sorry
#16 Updated by Szilárd Páll about 8 years ago
Roland Schulz wrote:
What exactly do you mean with "version field". Does it have a redmine issue? If not could you create one and summarize the discussion?
http://redmine.gromacs.org/issues/689
The "Detected in" drop-down field was removed without much discussion 1.5-2 years ago and has never been re-added. With that not only information of some (then) existing bugs was trashed, but we also crippled the database: since then almost nobody adds version information to the bugs.
#17 Updated by Roland Schulz about 8 years ago
Thanks. I misunderstood you and I thought you were talking about some better reporting of the version number in the mdrun log/output (which I think is OK - so I was confused).
#18 Updated by Roland Schulz about 8 years ago
- Status changed from New to Closed