Project

General

Profile

Bug #2321

mdrun exits with buffer registering error on non-GPU host

Added by Szilárd Páll over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

On a host where already GPU device detection fails, mdrun would previously default to CPU mode. The 2018 beta1 however tries to ping some buffers which can fail e.g. due to no no driver being installed:

$ gmx mdrun -quiet -ntmpi 1 -nsteps 0

Back Off! I just backed up md.log to ./#md.log.8#
NOTE: Error occurred during GPU detection:
      CUDA driver version is insufficient for CUDA runtime version
      Can not use GPU acceleration, will fall back to CPU kernels.

Reading file topol.tpr, VERSION 5.0.2-dev-20140905-f878c88 (single precision)
Note: file tpx version 100, software tpx version 112

Overriding nsteps with value passed on the command line: 0 steps, 0 ps
Changing nstlist from 20 to 80, rlist from 1.031 to 1.147

Using 1 MPI thread
Using 16 OpenMP threads 

WARNING: Could not register the host memory for page locking for GPU transfers. An unhandled error from a previous CUDA operation was detected. cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

-------------------------------------------------------
Program:     gmx mdrun, version 2018-beta1-dev-20171202-ef012f1
Source file: src/gromacs/gpu_utils/pinning.cu (line 95)
Function:    void gmx::pinBuffer(void*, std::size_t)

Assertion failed:
Condition: stat == cudaSuccess
Could not register the host memory for page locking for GPU transfers.
cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA
runtime version

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

This means that a CUDA-enabled mdrun build will often (always?) fail on a GPU-less host.

test_1x32_nb-cpu_pme-cpu_auto-launch.log (14.6 KB) test_1x32_nb-cpu_pme-cpu_auto-launch.log Szilárd Páll, 12/11/2017 09:41 PM
md.log (14.5 KB) md.log Szilárd Páll, 12/11/2017 09:41 PM
test_16x2_nb-cpu_pme-cpu_auto-launch.log (26.9 KB) test_16x2_nb-cpu_pme-cpu_auto-launch.log Szilárd Páll, 12/11/2017 09:41 PM

Related issues

Related to GROMACS - Bug #2315: Separate PME ranks are not assigned since e87a53Closed

Associated revisions

Revision 29c059ad (diff)
Added by Mark Abraham over 1 year ago

Consume any error produced during GPU detection

Having reported it, we should clear the CUDA error status so that
future calls do not continue to return it.

Fixes #2321

Change-Id: Id5c6445074b6b835296fcb544b7fc94168edc974

Revision 6803c658 (diff)
Added by Mark Abraham over 1 year ago

Separate canDetectGpus and findGpus futher, and fix tests

Renamed detect_gpus to findGpus so that no code can silently call
detect_gpus while forgetting to call the required canDetectGpus first.
Some test code is updated accordingly, which should have happened
earlier. The function with the new name now needs no return value, so
the formerly confusing return value of zero for success is no longer
present.

Shifted some more responsibilities from findGpus to canDetectGpus, so
that the latter now has responsibility for ensuring that when it
returns true, the former will always succeed.

Fixed tests that compile with CUDA, but cannot run unless there
are visible comatible devices and a valid context.

Refs #2347, #2322, #2321

Change-Id: I34acf8be4c0f0dcc29e931d83c970ba945865ca7

History

#1 Updated by Szilárd Páll over 1 year ago

  • Affected version changed from 2018 to 2018-beta1

#2 Updated by Szilárd Páll over 1 year ago

  • Subject changed from mdrun exits with pinning error on non-GPU host to mdrun exits with buffer registering error on non-GPU host

#3 Updated by Mark Abraham over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Mark Abraham
  • Target version changed from 2018 to 2018-beta2

#4 Updated by Mark Abraham over 1 year ago

  • Target version changed from 2018-beta2 to 2018-beta3

I ran out of time to look into this, so bumping

#5 Updated by Mark Abraham over 1 year ago

There are two issues here: the cudaErrorInsufficientDriver ought to be cleared by the gpu-detection code, e.g cudaGetLastError should be called in the error patch of detect_gpus(). Probably this has been silently not causing problems for a while.

The way this run erroneously took a PME-on-GPU code path was fixed by a0d89bf8e412554a58e803715847b0f9dd920886 in fixing #2315

#6 Updated by Mark Abraham over 1 year ago

  • Related to Bug #2315: Separate PME ranks are not assigned since e87a53 added

#7 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2321.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~Id5c6445074b6b835296fcb544b7fc94168edc974
Gerrit URL: https://gerrit.gromacs.org/7316

#8 Updated by Mark Abraham over 1 year ago

  • Status changed from In Progress to Fix uploaded

#9 Updated by Mark Abraham over 1 year ago

  • Target version changed from 2018-beta3 to 2018-beta2

#10 Updated by Mark Abraham over 1 year ago

  • Status changed from Fix uploaded to Resolved

#11 Updated by Erik Lindahl over 1 year ago

  • Status changed from Resolved to Closed

#12 Updated by Szilárd Páll over 1 year ago

This does not seem to be resolved and somehow the issue is linked to DD vs no DD runs as in the former I see no errors; I get the impression that the latter seem to still want to use a GPU when none were detected -- the OpenMP thread heuristics also kick in see the second log attached.

#13 Updated by Mark Abraham over 1 year ago

Szilárd Páll wrote:

This does not seem to be resolved and somehow the issue is linked to DD vs no DD runs as in the former I see no errors; I get the impression that the latter seem to still want to use a GPU when none were detected -- the OpenMP thread heuristics also kick in see the second log attached.

Those log files were generated with gmx from 700d6f3, which unfortunately is missing more than 30 commits, including those where I addressed issues identified here. So I hope these are actually fixed :-)

#14 Updated by Szilárd Páll over 1 year ago

Mark Abraham wrote:

Szilárd Páll wrote:

This does not seem to be resolved and somehow the issue is linked to DD vs no DD runs as in the former I see no errors; I get the impression that the latter seem to still want to use a GPU when none were detected -- the OpenMP thread heuristics also kick in see the second log attached.

Those log files were generated with gmx from 700d6f3, which unfortunately is missing more than 30 commits, including those where I addressed issues identified here. So I hope these are actually fixed :-)

Sorry about that, looks like I mixed up my binaries/directories. Will retest.

#15 Updated by Szilárd Páll over 1 year ago

  • Status changed from Blocked, need info to Closed

i've been running tests on hosts with "insufficient CUDA driver" and everything seems to be in order. Sorry for the noise.

#16 Updated by Mark Abraham over 1 year ago

Szilárd Páll wrote:

i've been running tests on hosts with "insufficient CUDA driver" and everything seems to be in order. Sorry for the noise.

No worries, thanks for the vigilance

#17 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2321.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~I34acf8be4c0f0dcc29e931d83c970ba945865ca7
Gerrit URL: https://gerrit.gromacs.org/7349

Also available in: Atom PDF