Project

General

Profile

Bug #1062

System crashing with Acceleration=None and GPU

Added by Sebastian Waltz over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I run GROMACS on my local workstation (CPU=i7, GPU=2*GTX670, debian version 6.0) and on the cluster (2*X5650, M2090, scientific linux 5.5). I installed on both systems the 4.6-beta1 release using gcc 4.6, fftw3 version 3.3.3 and cuda 4.2. On my local workstation everything works perfectly fine, I can run simulations for several 100ns without any problems and comparing the trajectories with runs made with the 4.5.5 version (CPU only) I can not find any significant discrepancies. When I try to run the same setup (attached tpr file) on the cluster I get after a few steps lincs warnings (attached log file) a the run breaks with a segfault. I also tried the newest git version (4.6-beta1-dev-20121205-054d40f) and still get the same warnings. I compiled gromacs again using icc and the mkl (10.3.5) libs.

Dears

Sebastian

pdz_cis_NPT_equi.log View (12.5 KB) Sebastian Waltz, 12/06/2012 11:05 AM

pdz_cis_NPT_equi.tpr (865 KB) Sebastian Waltz, 12/06/2012 11:06 AM

Associated revisions

Revision 928ff177 (diff)
Added by Berk Hess over 4 years ago

fixed GPU pair-search with GMX_CPU_ACCELERATION=None

Fixes #1042 and fixes #1062

Change-Id: I95ed242823aa1c108fc6c26bedc88062a0cd81d7

History

#1 Updated by Sebastian Waltz over 4 years ago

Sebastian Waltz wrote:

I run GROMACS on my local workstation (CPU=i7, GPU=2*GTX670, debian version 6.0) and on the cluster (2*X5650, M2090, scientific linux 5.5). I installed on both systems the 4.6-beta1 release using gcc 4.6, fftw3 version 3.3.3 and cuda 4.2. On my local workstation everything works perfectly fine, I can run simulations for several 100ns without any problems and comparing the trajectories with runs made with the 4.5.5 version (CPU only) I can not find any significant discrepancies. When I try to run the same setup (attached tpr file) on the cluster I get after a few steps lincs warnings (attached log file) a the run breaks with a segfault. I also tried the newest git version (4.6-beta1-dev-20121205-054d40f) and still get the same warnings. I compiled gromacs again using icc and the mkl (10.3.5) libs.

Dears

Sebastian

#2 Updated by Roland Schulz over 4 years ago

The log file is with GCC 4.1.2. Do you have the problem on the cluster also with gcc 4.6?
BTW: The log file also says "Acceleration selected at GROMACS compile time: None". Did you manually select that or was that automatically selected? Were you told to "disable SSE4.1" and selected None? We meant to imply that one should use SSE2 but that message is obviously confusing. Could you also test whether you still have the problem when selecting SSE2?

#3 Updated by Sebastian Waltz over 4 years ago

Roland Schulz wrote:

The log file is with GCC 4.1.2. Do you have the problem on the cluster also with gcc 4.6?
BTW: The log file also says "Acceleration selected at GROMACS compile time: None". Did you manually select that or was that automatically selected? Were you told to "disable SSE4.1" and selected None? We meant to imply that one should use SSE2 but that message is obviously confusing. Could you also test whether you still have the problem when selecting SSE2?

Its the missing SSE2 which makes it crash. Including SSE2 support fixes the system.
The log file is somewhat strange since it still shows gcc 4.1.2, but with 4.1.2 I never was able to get SSE2 working and I am sure that I use the gcc 4.6 for compiling GROMACS.

#4 Updated by Roland Schulz over 4 years ago

  • Subject changed from Strange lincs warnings to System crashing with Acceleration=None

It is important that the log file is correct and there are no known problems. Could you check in your CMakeCache.txt for the line CMAKE_C_COMPILER. And then check which version that compiler is by executing it with --version?

#5 Updated by Roland Schulz over 4 years ago

  • Subject changed from System crashing with Acceleration=None to System crashing with Acceleration=None and GPU
  • Assignee changed from Roland Schulz to Szilárd Páll

I can reproduce this. Seems to only a be a problem in the combination of GPU (is fine with -nb cpu) and Acceleration=None.

#6 Updated by Roland Schulz over 4 years ago

PS: it doesn't seem to depend on the compiler. It is an issue both with GCC 4.1.2 and 4.6.1

#7 Updated by Berk Hess over 4 years ago

  • Assignee changed from Szilárd Páll to Berk Hess

I already knew about this issue. It must simply be a bug in my pair search code. But I didn't find it yet. It's not a memory error, valgrind is happy. Probably the bounding box distance is not calculated correctly. PS This is completely uncritical as with a GPU you always have an SSE capable CPU.

#8 Updated by Roland Schulz over 4 years ago

I agree but before 3e0842aad57dceae7fa4 the SSE4.1 message was confusing and Sebastian probably wasn't the only one who disabled cpu acceleration altogether. If we can't find it before the next beta we could but in an error in cmake for the cpu-acceleration=none + gpu combination.

#9 Updated by Berk Hess over 4 years ago

  • Status changed from New to Feedback wanted

#10 Updated by Berk Hess over 4 years ago

  • Status changed from Feedback wanted to Closed

Also available in: Atom PDF