Project

General

Profile

Bug #3239

GPU DD direct communication with GPU update error with RF

Added by Szilárd Páll 10 months ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
2020-beta3-dev-20191212-2c760d2 + GPU update limitations removed
Affected version:
Difficulty:
uncategorized
Close

Description

The ion_channel set up with h-bonds constraints and Reaction-field, when executed with GPU DD enabled on a 2xGV100 machine throws the following error:

$ GMX_GPU_DD_COMMS=1 GMX_USE_GPU_BUFFER_OPS=1 GMX_GPU_PME_PP_COMMS=1  $gmx mdrun -ntmpi 8 $opts -nb gpu -bonded gpu -nsteps 20000 -notunepme -update gpu -s rf -nstlist 100

[...]

Command line:
  gmx mdrun -ntmpi 8 -v -resethway -noconfout -pin on -nb gpu -bonded gpu -nsteps 20000 -notunepme -update gpu -s rf -nstlist 100

Back Off! I just backed up md.log to ./#md.log.39#
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see
log).
Reading file rf.tpr, VERSION 2020-beta3-dev-20191212-2c760d2-dirty (single precision)
NOTE: This run uses the 'GPU buffer ops' feature, enabled by the GMX_USE_GPU_BUFFER_OPS environment variable.

NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

NOTE: GMX_GPU_PME_PP_COMMS environment variable detected, but the 'GPU PME-PP communications' feature was not enabled as PME is not offloaded to the GPU.

Overriding nsteps with value passed on the command line: 20000 steps, 50 ps
Changing nstlist from 10 to 100, rlist from 1.127 to 1.621

On host skylake-x-gpu01 2 GPUs selected for this run.
Mapping of GPU IDs to the 8 GPU tasks in the 8 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
Using 8 MPI threads
Using 3 OpenMP threads per tMPI thread

WARNING: There are no atom pairs for dispersion correction

Back Off! I just backed up ener.edr to ./#ener.edr.59#
starting mdrun 'Protein'
20000 steps,     50.0 ps.
step 0
-------------------------------------------------------
Program:     gmx mdrun, version 2020-beta3-dev-20191212-2c760d2-dirty
Source file: src/gromacs/gpu_utils/cudautils.cuh (line 238)
MPI rank:    0 (out of 8)

Fatal error:
cudaStreamSynchronize failed: an illegal memory access was encountered

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
rf.tpr (18.5 MB) rf.tpr Szilárd Páll, 12/13/2019 06:30 PM
md.log (21.1 KB) md.log Szilárd Páll, 12/13/2019 06:31 PM

History

#1 Updated by Szilárd Páll 10 months ago

side-note: -ntmpi 4 and 6 do work.

#2 Updated by Alan Gray 10 months ago

I'm not able to reproduce this - test case works fine for me with same options on 2xV100-PCIe, both with update on CPU and GPU, for the latest commit in the release-2020 branch (commit c981aa8). @Szilard I'm not sure which exact version you are testing with - can you try with this same commit as above? Note that, for update to run on GPU, this also needs export GMX_FORCE_UPDATE_DEFAULT_GPU=true (until https://gerrit.gromacs.org/c/gromacs/+/14742 is merged).

Also available in: Atom PDF