Project

General

Profile

Bug #2067

mdrun ignores GPUs being requested if detection fails or is skipped

Added by Szilárd Páll almost 3 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
2016
Affected version:
Difficulty:
uncategorized
Close

Description

If the detection either fails (driver not loaded, no devices present, etc.) or is disabled (by var or -nb cpu), all GPU request on the command line are ignored by mdrun and CPU fallback is used.

Reproducers:

CUDA_VISIBLE_DEVICES= $gmx mdrun -gpu_id 1 # GPU ID 1 requested, none detected, but CPU kernels used

CUDA_VISIBLE_DEVICES= $gmx mdrun -nb gpu # GPU non-bondeds requested, no GPU detected, but CPU kernels used

The above two cases should definitely result in errors as the user obviously requested GPUs to be used.

Additionally, if we consider -gpu_id passed to be a non-optional requirement for GPU to be used than the following should possibly also issue an error, but this is more a matter of interpretation:

gmx mdrun -gpu_id 0 -nb cpu # CPU run requested GPU ID is ignored.


Related issues

Related to GROMACS - Bug #2004: parallelism selection code needs workClosed

Associated revisions

Revision b9a8c49d (diff)
Added by Mark Abraham about 2 years ago

Consolidate and fix logic for mdrun -nb and -gpuid

Several aspects of task assignment did not work as well as it should.

If gmx mdrun -gpu_id 01 is intended specify that work run on those
GPUs, then e.g. if the tpr uses the Group scheme, then mdrun should
refuse to run, just like it does for mdrun -nb gpu. Now it does
refuse.

gmx mdrun -nb cpu -gpu_id 01 should always give a fatal error, and now
does.

CUDA_VISIBLE_DEVICES="" gmx mdrun -nb gpu and the same with -gpu_id
should give a fatal error, and now does.

After this change, if the user has required short-ranged work on a GPU
with -nb gpu, or made an explicit GPU task assignment with -gpu_id
without using -nb cpu, exit quickly unless GPU support is compiled,
the Verlet scheme is active, and GPUs were found.

Introduced a helper function for whether compatible GPUs have been
found, to help improve encapsulation and readability.

Removed hack from mdrun integration tests that coped with early
implementations of -gpu_id, which is no longer needed.

Fixes #2067

Change-Id: Ic5091edc892b0fcb0371720a5000b80019b5b3d2

Revision f9d71af1 (diff)
Added by Mark Abraham about 2 years ago

Annotate test cases that cannot run on GPUs

Jenkins runs gmxtest.pl in post-submit testing by specifying values
for -gpu_id for it to use. mdrun now interprets -gpu_id as requiring a
GPU run, and gives a fatal error if a group-scheme tpr is supplied.

The group-scheme test cases are now updated with an annotation (like
max-mpi-ranks) in the cases where GPUs are not supported (ie group
scheme), so the runner can know to omit specifying an erroneous
-gpu_id. Once everything runs with the Verlet scheme, then these
annotations will no longer be required, and will disappear naturally.

The feature where test cases that did run on a GPU are then run again
with -nb cpu is modified so that those repeat runs do not specify an
erroneous -gpu_id.

Refs #2067

Change-Id: Ia16a59f57be0770be921b072b8f51639a9cc6dc0

History

#1 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2067.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-5-1~I8b74de07151a0dd2e029a50a9d7e32184c8cf1d8
Gerrit URL: https://gerrit.gromacs.org/6440

#2 Updated by Mark Abraham over 2 years ago

I think the behaviour should be an error in all three cases.

However since the misbehaviour can't affect simulation correctness, I think the appropriate place to attempt fix this is release-2016 branch.

#3 Updated by Szilárd Páll over 2 years ago

  • Assignee set to Szilárd Páll
  • Target version set to 2016.2

Mark Abraham wrote:

I think the behaviour should be an error in all three cases.

OK. The proposed implementation does that. We'll need to remove '-gpu_id N' from the CPU runs before this can pass regressiontests, though third case).

However since the misbehaviour can't affect simulation correctness, I think the appropriate place to attempt fix this is release-2016 branch.

Sure, I can move it to the current release.

#4 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2067.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2016~I8b74de07151a0dd2e029a50a9d7e32184c8cf1d8
Gerrit URL: https://gerrit.gromacs.org/6451

#5 Updated by Mark Abraham over 2 years ago

Szilard's fix on release-2016 does produce a suitable error for the first two cases.

With that (and perhaps before?), the third case does produce an error, but obviously not optimal:

  gmx mdrun -nb cpu -gpu_id 0 -deffnm nonanol_vacuo

Running on 1 node with total 4 cores, 8 logical cores (GPU detection deactivated)
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
    SIMD instructions most likely to fit this hardware: AVX_256
    SIMD instructions selected at GROMACS compile time: AVX_256

  Hardware topology: Full, with devices

Reading file nonanol_vacuo.tpr, VERSION 5.1.3 (single precision)
Note: file tpx version 103, software tpx version 110
Can not increase nstlist because an NVE ensemble is used
Using 1 MPI thread
Using 8 OpenMP threads 

-------------------------------------------------------
Program:     gmx mdrun, version 2016.2-dev-20170203-ba59db75b
Source file: src/gromacs/hardware/detecthardware.cpp (line 375)

Fatal error:
Can't use the requested GPU device(s) (passed IDs: 0), no suitable devices
were detected!

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

#6 Updated by Mark Abraham over 2 years ago

  • Target version changed from 2016.2 to 2016.3

Since our test harnesses depend on at least some of this behaviour that we now consider buggy, we'll have to take some more time over considering how to fix anything.

#7 Updated by Szilárd Páll over 2 years ago

Mark Abraham wrote:

Since our test harnesses depend on at least some of this behaviour that we now consider buggy, we'll have to take some more time over considering how to fix anything.

I agree. Perhaps even worth considering whether to fix anything in the release branch instead of just documenting the peculiar mdrun behavior wrt GPU options (i.e. that when the GPU detection fails or HW compatibility/sanity checks are not passed, -nb and -gpu_id are simply ignored).

#8 Updated by Mark Abraham over 2 years ago

  • Target version changed from 2016.3 to 2016.4

It will be impractical to consider fixing this for 2016.3

#9 Updated by Mark Abraham over 2 years ago

  • Related to Bug #2004: parallelism selection code needs work added

#10 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2067.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~Ic5091edc892b0fcb0371720a5000b80019b5b3d2
Gerrit URL: https://gerrit.gromacs.org/6721

#11 Updated by Mark Abraham about 2 years ago

  • Status changed from New to Resolved

#12 Updated by Mark Abraham about 2 years ago

  • Assignee changed from Szilárd Páll to Mark Abraham
  • Target version changed from 2016.4 to 2018

I don't think it is practical / safe to try to fix this in release 2016.

#13 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2067.
Uploader: Mark Abraham ()
Change-Id: regressiontests~master~Ia16a59f57be0770be921b072b8f51639a9cc6dc0
Gerrit URL: https://gerrit.gromacs.org/6735

#14 Updated by Mark Abraham about 2 years ago

I think this is fixed for 2017, but probably worth testing specifically during the beta phase.

#15 Updated by Erik Lindahl over 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF