Project

General

Profile

Bug #2726

FP exception in cufft 7.0

Added by Mark Abraham about 1 year ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With release-2019 HEAD, on bs-gpu01 (gcc 4.8 and cuda 7.0 from modules, debug build), with various regressiontests (here complex/dd121) I run with

gmx mdrun -ntmpi 1 -bonded cpu -pme gpu -pmefft gpu -notunepme

...

Reading file topol.tpr, VERSION 2019-beta2-dev-20181101-f31fe12 (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Using 1 MPI thread
Using 6 OpenMP threads 

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU

and get a segfault. Backtrace from gdb:

#0  0x00007ffff00dccb2 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#1  0x00007ffff00dde57 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#2  0x00007ffff00dc19a in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#3  0x00007ffff00db693 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#4  0x00007ffff00da12f in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#5  0x00007ffff008cec7 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#6  0x00007ffff008d087 in cufftLockPlan () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#7  0x00007ffff00e1aa3 in cufftMakePlanMany () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#8  0x00007ffff00e2fa2 in cufftPlanMany () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#9  0x00007ffff5984113 in GpuParallel3dFft::GpuParallel3dFft (this=0x1a316e0, pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-3dfft.cu:95
#10 0x00007ffff598ba25 in gmx::compat::make_unique<GpuParallel3dFft<PmeGpu const*&> > () at /nethome/mabraham/git/master/src/gromacs/compat/make_unique.h:107
#11 0x00007ffff5987361 in pme_gpu_reinit_3dfft (pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:582
#12 0x00007ffff5988027 in pme_gpu_reinit_grids (pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:720
#13 0x00007ffff5988e6e in pme_gpu_reinit (pme=0xe0c2e0, gpuInfo=0x6e09b0, pmeGpuProgram=0x979fa0) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:898
#14 0x00007ffff72510f5 in gmx_pme_init (cr=0x6829a0, numPmeDomains=..., ir=0x7fffffffbd60, homenr=1789, bFreeEnergy_q=false, bFreeEnergy_lj=false, bReproducible=false, ewaldcoeff_q=3.67460346, 
    ewaldcoeff_lj=0, nthread=6, runMode=PmeRunMode::GPU, pmeGpu=0x0, gpuInfo=0x6e09b0, pmeGpuProgram=0x979fa0) at ../src/gromacs/ewald/pme.cpp:966
#15 0x00007ffff72bbe08 in gmx::Mdrunner::mdrunner (this=0x7fffffffc7f0) at ../src/gromacs/mdrun/runner.cpp:1331
#16 0x000000000040b757 in gmx::gmx_mdrun (argc=10, argv=0x7fffffffd460) at ../src/programs/mdrun/mdrun.cpp:292
#17 0x00007ffff644c103 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0x678340, argc=10, argv=0x7fffffffd460) at ../src/gromacs/commandline/cmdlinemodulemanager.cpp:133
#18 0x00007ffff644dbf8 in gmx::CommandLineModuleManager::run (this=0x7fffffffd330, argc=10, argv=0x7fffffffd460) at ../src/gromacs/commandline/cmdlinemodulemanager.cpp:589
#19 0x000000000040f9ae in main (argc=11, argv=0x7fffffffd458) at ../src/programs/gmx.cpp:60

More diagnostics:

             :-) GROMACS - gmx, 2019-beta2-dev-20181101-56b66d5 (-:

Executable:   /nethome/mabraham/git/master/build-cmake-gcc-4.8-cuda-7.0-debug/install/bin/gmx
Data prefix:  /nethome/mabraham/git/master/build-cmake-gcc-4.8-cuda-7.0-debug/install
Working dir:  /nethome/mabraham/nfs-git/regressiontests
Command line:
  gmx -version -quiet

GROMACS version:    2019-beta2-dev-20181101-56b66d5
GIT SHA1 hash:      56b66d5a2c5e037c1140df9546506acfb95f7b04
Branched from:      5abf87a1bf3a140701a7ffb0f82e21238dc6d232 (26 newer local commits)
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX_256
FFT library:        fftw-3.3.6-pl1-sse2
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.2
Tracing support:    disabled
C compiler:         /opt/tcbsys/gcc/4.8.5/bin/gcc GNU 4.8.5
C compiler flags:    -mavx    -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter  -g -fno-inline 
C++ compiler:       /opt/tcbsys/gcc/4.8.5/bin/g++ GNU 4.8.5
C++ compiler flags:  -mavx    -std=c++11  -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall  -g -fno-inline 
CUDA compiler:      /opt/tcbsys/cuda/7.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on Mon_Feb_16_22:59:02_CST_2015;Cuda compilation tools, release 7.0, V7.0.27
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_30,code=compute_30;-use_fast_math;-D_FORCE_INLINES;; ;-mavx;-std=c++11;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wmissing-declarations;-Wall;-g;-fno-inline;
CUDA driver:        10.0
CUDA runtime:       7.0

Running with -pmefft cpu or -pme cpu is ok.

Running with gmx mdrun -ntmpi 2 -bonded cpu -pme gpu -pmefft gpu -notunepme -npme 1 fails in the same cufft call

Associated revisions

Revision c47fc09c (diff)
Added by Szilárd Páll 11 months ago

Document the FP exception in cuFFT 7.0

To document known issues relevant for developers, a new section is added
to the dev guide.
As a a first entry, this commit documents the FP exception that aborts
debug builds of mdrun that offload FFTs to an NVIDIA GPU with CUDA 7.0.

Fixes #2726

Change-Id: I60c7b92da2c703e0910644012a01a009b3df0a7a

History

#1 Updated by Szilárd Páll about 1 year ago

(Misread the original issue not realizing this is a cuFFT not clFFT crash.)

The issue might have something to do with cuFFT 7.0 + CUDA 10 drivers; we should try to reproduce with same cuFFT + older driver and newer cuFFT + same driver.

#2 Updated by Szilárd Páll 11 months ago

  • Status changed from New to Blocked, need info

Made multiple attempts, but can not reproduce it.

#3 Updated by Szilárd Páll 11 months ago

  • Subject changed from segfault in cufft pme to FP exception in cufft 7.0

OK, was on the wrong track. This is not a SIGSEGV, but a SIGFPE.

#4 Updated by Szilárd Páll 11 months ago

Suggested solution: don't use CUDA 7.0 / document it as a known issue. Results are correct, so only affects devs who build in debug mode and for whatever reason use such an ancient CUDA version.

#5 Updated by Gerrit Code Review Bot 11 months ago

Gerrit received a related patchset '1' for Issue #2726.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2019~I60c7b92da2c703e0910644012a01a009b3df0a7a
Gerrit URL: https://gerrit.gromacs.org/8806

#6 Updated by Szilárd Páll 11 months ago

  • Status changed from Blocked, need info to Resolved

#7 Updated by Paul Bauer 11 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF