Project

General

Profile

Bug #3398

Intermittent failure of non-bonded kernels when run using nvprof

Added by Jonathan Vincent about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Have been seeing some intermittent failures of Gromacs when running with nvprof and its replacement nsight systems.

-------------------------------------------------------
Program: gmx mdrun, version 2019.3
Source file: src/gromacs/gpu_utils/cudautils.cuh (line 251)

Fatal error:
Unexpected cudaStreamQuery failure: an illegal memory access was encountered

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

What we have seen so far

  • Not seen when not running with nvprof/nsight systems
  • Intermittent, around 10%-20% failure rate
  • Seen in multiple Gromacs versions (2020, 2019.4, 2019.3)
  • Not seen on all GPUs, reported on RTX 2080, Titan RTX
  • Seen on multiple CUDA versions.
  • Not seen when Debug mode enabled.
  • Does not seem to be affected by GCC Version

The failure is in the non-bonded kernel stream. Turning off non-bonded (and the neighbour list purge) and I could not reproduce the problem. Trying to narrow down which array is the issue right now.

0001.5_md_use_default_gcc.log (16 KB) 0001.5_md_use_default_gcc.log Jonathan Vincent, 02/25/2020 11:42 AM
0012_md_use_gcc8.2.log (15.4 KB) 0012_md_use_gcc8.2.log Jonathan Vincent, 02/25/2020 11:42 AM
0096_md_use_gcc8.2.log (15.7 KB) 0096_md_use_gcc8.2.log Jonathan Vincent, 02/25/2020 11:42 AM

Also available in: Atom PDF