Project

General

Profile

Bug #2726

segfault in cufft pme

Added by Mark Abraham 19 days ago. Updated 19 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With release-2019 HEAD, on bs-gpu01 (gcc 4.8 and cuda 7.0 from modules, debug build), with various regressiontests (here complex/dd121) I run with

gmx mdrun -ntmpi 1 -bonded cpu -pme gpu -pmefft gpu -notunepme

...

Reading file topol.tpr, VERSION 2019-beta2-dev-20181101-f31fe12 (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Using 1 MPI thread
Using 6 OpenMP threads 

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU

and get a segfault. Backtrace from gdb:

#0  0x00007ffff00dccb2 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#1  0x00007ffff00dde57 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#2  0x00007ffff00dc19a in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#3  0x00007ffff00db693 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#4  0x00007ffff00da12f in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#5  0x00007ffff008cec7 in ?? () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#6  0x00007ffff008d087 in cufftLockPlan () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#7  0x00007ffff00e1aa3 in cufftMakePlanMany () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#8  0x00007ffff00e2fa2 in cufftPlanMany () from /opt/tcbsys/cuda/7.0/lib64/libcufft.so.7.0
#9  0x00007ffff5984113 in GpuParallel3dFft::GpuParallel3dFft (this=0x1a316e0, pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-3dfft.cu:95
#10 0x00007ffff598ba25 in gmx::compat::make_unique<GpuParallel3dFft<PmeGpu const*&> > () at /nethome/mabraham/git/master/src/gromacs/compat/make_unique.h:107
#11 0x00007ffff5987361 in pme_gpu_reinit_3dfft (pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:582
#12 0x00007ffff5988027 in pme_gpu_reinit_grids (pmeGpu=0x191d790) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:720
#13 0x00007ffff5988e6e in pme_gpu_reinit (pme=0xe0c2e0, gpuInfo=0x6e09b0, pmeGpuProgram=0x979fa0) at /nethome/mabraham/git/master/src/gromacs/ewald/pme-gpu-internal.cpp:898
#14 0x00007ffff72510f5 in gmx_pme_init (cr=0x6829a0, numPmeDomains=..., ir=0x7fffffffbd60, homenr=1789, bFreeEnergy_q=false, bFreeEnergy_lj=false, bReproducible=false, ewaldcoeff_q=3.67460346, 
    ewaldcoeff_lj=0, nthread=6, runMode=PmeRunMode::GPU, pmeGpu=0x0, gpuInfo=0x6e09b0, pmeGpuProgram=0x979fa0) at ../src/gromacs/ewald/pme.cpp:966
#15 0x00007ffff72bbe08 in gmx::Mdrunner::mdrunner (this=0x7fffffffc7f0) at ../src/gromacs/mdrun/runner.cpp:1331
#16 0x000000000040b757 in gmx::gmx_mdrun (argc=10, argv=0x7fffffffd460) at ../src/programs/mdrun/mdrun.cpp:292
#17 0x00007ffff644c103 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0x678340, argc=10, argv=0x7fffffffd460) at ../src/gromacs/commandline/cmdlinemodulemanager.cpp:133
#18 0x00007ffff644dbf8 in gmx::CommandLineModuleManager::run (this=0x7fffffffd330, argc=10, argv=0x7fffffffd460) at ../src/gromacs/commandline/cmdlinemodulemanager.cpp:589
#19 0x000000000040f9ae in main (argc=11, argv=0x7fffffffd458) at ../src/programs/gmx.cpp:60

More diagnostics:

             :-) GROMACS - gmx, 2019-beta2-dev-20181101-56b66d5 (-:

Executable:   /nethome/mabraham/git/master/build-cmake-gcc-4.8-cuda-7.0-debug/install/bin/gmx
Data prefix:  /nethome/mabraham/git/master/build-cmake-gcc-4.8-cuda-7.0-debug/install
Working dir:  /nethome/mabraham/nfs-git/regressiontests
Command line:
  gmx -version -quiet

GROMACS version:    2019-beta2-dev-20181101-56b66d5
GIT SHA1 hash:      56b66d5a2c5e037c1140df9546506acfb95f7b04
Branched from:      5abf87a1bf3a140701a7ffb0f82e21238dc6d232 (26 newer local commits)
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX_256
FFT library:        fftw-3.3.6-pl1-sse2
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.2
Tracing support:    disabled
C compiler:         /opt/tcbsys/gcc/4.8.5/bin/gcc GNU 4.8.5
C compiler flags:    -mavx    -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter  -g -fno-inline 
C++ compiler:       /opt/tcbsys/gcc/4.8.5/bin/g++ GNU 4.8.5
C++ compiler flags:  -mavx    -std=c++11  -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall  -g -fno-inline 
CUDA compiler:      /opt/tcbsys/cuda/7.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on Mon_Feb_16_22:59:02_CST_2015;Cuda compilation tools, release 7.0, V7.0.27
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_30,code=compute_30;-use_fast_math;-D_FORCE_INLINES;; ;-mavx;-std=c++11;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wmissing-declarations;-Wall;-g;-fno-inline;
CUDA driver:        10.0
CUDA runtime:       7.0

Running with -pmefft cpu or -pme cpu is ok.

Running with gmx mdrun -ntmpi 2 -bonded cpu -pme gpu -pmefft gpu -notunepme -npme 1 fails in the same cufft call

History

#1 Updated by Szilárd Páll 19 days ago

(Misread the original issue not realizing this is a cuFFT not clFFT crash.)

The issue might have something to do with cuFFT 7.0 + CUDA 10 drivers; we should try to reproduce with same cuFFT + older driver and newer cuFFT + same driver.

Also available in: Atom PDF