Bug #2726
FP exception in cufft 7.0
Description
With release-2019 HEAD, on bs-gpu01 (gcc 4.8 and cuda 7.0 from modules, debug build), with various regressiontests (here complex/dd121) I run with
gmx mdrun -ntmpi 1 -bonded cpu -pme gpu -pmefft gpu -notunepme ... Reading file topol.tpr, VERSION 2019-beta2-dev-20181101-f31fe12 (single precision) Can not increase nstlist because verlet-buffer-tolerance is not set or used Using 1 MPI thread Using 6 OpenMP threads 1 GPU auto-selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:0,PME:0 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU
and get a segfault. Backtrace from gdb:
#0 0x00007ffff00dccb2 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #1 0x00007ffff00dde57 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #2 0x00007ffff00dc19a in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #3 0x00007ffff00db693 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #4 0x00007ffff00da12f in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #5 0x00007ffff008cec7 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #6 0x00007ffff008d087 in cufftLockPlan () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #7 0x00007ffff00e1aa3 in cufftMakePlanMany () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #8 0x00007ffff00e2fa2 in cufftPlanMany () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0 #9 0x00007ffff5984113 in GpuParallel3dFft::GpuParallel3dFft (this=0x1a316e0, pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-3dfft.cu:95 #10 0x00007ffff598ba25 in gmx::compat::make_unique<GpuParallel3dFft<PmeGpu const*&> > () at /nethome/mabraham/git/master/src/gromacs/compat/make_unique.h:107 #11 0x00007ffff5987361 in pme_gpu_reinit_3dfft (pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:582 #12 0x00007ffff5988027 in pme_gpu_reinit_grids (pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:720 #13 0x00007ffff5988e6e in pme_gpu_reinit (pme=0xe0c2e0, gpuInfo=0x6e09b0, pmeGpuProgram=0x979fa0) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:898 #14 0x00007ffff72510f5 in gmx_pme_init (cr=0x6829a0, numPmeDomains=..., ir=0x7fffffffbd60, homenr=1789, bFreeEnergy_q=false, bFreeEnergy_lj=false, bReproducible=false, ewaldcoeff_q=3.67460346, ewaldcoeff_lj=0, nthread=6, runMode=PmeRunMode::GPU, pmeGpu=0x0, gpuInfo=0x6e09b0, pmeGpuProgram=0x979fa0) at ../src/gromacs/ewald/pme.cpp:966 #15 0x00007ffff72bbe08 in gmx::Mdrunner::mdrunner (this=0x7fffffffc7f0) at ../src/gromacs/mdrun/runner.cpp:1331 #16 0x000000000040b757 in gmx::gmx_mdrun (argc=10, argv=0x7fffffffd460) at ../src/programs/mdrun/mdrun.cpp:292 #17 0x00007ffff644c103 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0x678340, argc=10, argv=0x7fffffffd460) at ../src/gromacs/commandline/cmdlinemodulemanager.cpp:133 #18 0x00007ffff644dbf8 in gmx::CommandLineModuleManager::run (this=0x7fffffffd330, argc=10, argv=0x7fffffffd460) at ../src/gromacs/commandline/cmdlinemodulemanager.cpp:589 #19 0x000000000040f9ae in main (argc=11, argv=0x7fffffffd458) at ../src/programs/gmx.cpp:60
More diagnostics:
:-) GROMACS - gmx, 2019-beta2-dev-20181101-56b66d5 (-: Executable: /nethome/mabraham/git/master/build-cmake-gcc-4.8-cuda-7.0-debug/install/bin/gmx Data prefix: /nethome/mabraham/git/master/build-cmake-gcc-4.8-cuda-7.0-debug/install Working dir: /nethome/mabraham/nfs-git/regressiontests Command line: gmx -version -quiet GROMACS version: 2019-beta2-dev-20181101-56b66d5 GIT SHA1 hash: 56b66d5a2c5e037c1140df9546506acfb95f7b04 Branched from: 5abf87a1bf3a140701a7ffb0f82e21238dc6d232 (26 newer local commits) Precision: single Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support: CUDA SIMD instructions: AVX_256 FFT library: fftw-3.3.6-pl1-sse2 RDTSCP usage: enabled TNG support: enabled Hwloc support: hwloc-1.11.2 Tracing support: disabled C compiler: /opt/tcbsys/gcc/4.8.5/bin/gcc GNU 4.8.5 C compiler flags: -mavx -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -g -fno-inline C++ compiler: /opt/tcbsys/gcc/4.8.5/bin/g++ GNU 4.8.5 C++ compiler flags: -mavx -std=c++11 -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall -g -fno-inline CUDA compiler: /opt/tcbsys/cuda/7.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on Mon_Feb_16_22:59:02_CST_2015;Cuda compilation tools, release 7.0, V7.0.27 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_30,code=compute_30;-use_fast_math;-D_FORCE_INLINES;; ;-mavx;-std=c++11;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wmissing-declarations;-Wall;-g;-fno-inline; CUDA driver: 10.0 CUDA runtime: 7.0
Running with -pmefft cpu
or -pme cpu
is ok.
Running with gmx mdrun -ntmpi 2 -bonded cpu -pme gpu -pmefft gpu -notunepme -npme 1
fails in the same cufft call
Associated revisions
History
#1 Updated by Szilárd Páll about 2 years ago
(Misread the original issue not realizing this is a cuFFT not clFFT crash.)
The issue might have something to do with cuFFT 7.0 + CUDA 10 drivers; we should try to reproduce with same cuFFT + older driver and newer cuFFT + same driver.
#2 Updated by Szilárd Páll about 2 years ago
- Status changed from New to Blocked, need info
Made multiple attempts, but can not reproduce it.
#3 Updated by Szilárd Páll about 2 years ago
- Subject changed from segfault in cufft pme to FP exception in cufft 7.0
OK, was on the wrong track. This is not a SIGSEGV, but a SIGFPE.
#4 Updated by Szilárd Páll about 2 years ago
Suggested solution: don't use CUDA 7.0 / document it as a known issue. Results are correct, so only affects devs who build in debug mode and for whatever reason use such an ancient CUDA version.
#5 Updated by Gerrit Code Review Bot about 2 years ago
Gerrit received a related patchset '1' for Issue #2726.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: gromacs~release-2019~I60c7b92da2c703e0910644012a01a009b3df0a7a
Gerrit URL: https://gerrit.gromacs.org/8806
#6 Updated by Szilárd Páll about 2 years ago
- Status changed from Blocked, need info to Resolved
Applied in changeset c47fc09c3a782b0934408b792e963b57b676e744.
#7 Updated by Paul Bauer about 2 years ago
- Status changed from Resolved to Closed
Document the FP exception in cuFFT 7.0
To document known issues relevant for developers, a new section is added
to the dev guide.
As a a first entry, this commit documents the FP exception that aborts
debug builds of mdrun that offload FFTs to an NVIDIA GPU with CUDA 7.0.
Fixes #2726
Change-Id: I60c7b92da2c703e0910644012a01a009b3df0a7a