mdrun ignores GPUs being requested if detection fails or is skipped
If the detection either fails (driver not loaded, no devices present, etc.) or is disabled (by var or
-nb cpu), all GPU request on the command line are ignored by mdrun and CPU fallback is used.
CUDA_VISIBLE_DEVICES= $gmx mdrun -gpu_id 1 # GPU ID 1 requested, none detected, but CPU kernels used CUDA_VISIBLE_DEVICES= $gmx mdrun -nb gpu # GPU non-bondeds requested, no GPU detected, but CPU kernels used
The above two cases should definitely result in errors as the user obviously requested GPUs to be used.
Additionally, if we consider
-gpu_id passed to be a non-optional requirement for GPU to be used than the following should possibly also issue an error, but this is more a matter of interpretation:
gmx mdrun -gpu_id 0 -nb cpu # CPU run requested GPU ID is ignored.
Consolidate and fix logic for mdrun -nb and -gpuid
Several aspects of task assignment did not work as well as it should.
If gmx mdrun -gpu_id 01 is intended specify that work run on those
GPUs, then e.g. if the tpr uses the Group scheme, then mdrun should
refuse to run, just like it does for mdrun -nb gpu. Now it does
gmx mdrun -nb cpu -gpu_id 01 should always give a fatal error, and now
CUDA_VISIBLE_DEVICES="" gmx mdrun -nb gpu and the same with -gpu_id
should give a fatal error, and now does.
After this change, if the user has required short-ranged work on a GPU
with -nb gpu, or made an explicit GPU task assignment with -gpu_id
without using -nb cpu, exit quickly unless GPU support is compiled,
the Verlet scheme is active, and GPUs were found.
Introduced a helper function for whether compatible GPUs have been
found, to help improve encapsulation and readability.
Removed hack from mdrun integration tests that coped with early
implementations of -gpu_id, which is no longer needed.
Annotate test cases that cannot run on GPUs
Jenkins runs gmxtest.pl in post-submit testing by specifying values
for -gpu_id for it to use. mdrun now interprets -gpu_id as requiring a
GPU run, and gives a fatal error if a group-scheme tpr is supplied.
The group-scheme test cases are now updated with an annotation (like
max-mpi-ranks) in the cases where GPUs are not supported (ie group
scheme), so the runner can know to omit specifying an erroneous
-gpu_id. Once everything runs with the Verlet scheme, then these
annotations will no longer be required, and will disappear naturally.
The feature where test cases that did run on a GPU are then run again
with -nb cpu is modified so that those repeat runs do not specify an
#3 Updated by Szilárd Páll over 3 years ago
- Assignee set to Szilárd Páll
- Target version set to 2016.2
Mark Abraham wrote:
I think the behaviour should be an error in all three cases.
OK. The proposed implementation does that. We'll need to remove '-gpu_id N' from the CPU runs before this can pass regressiontests, though third case).
However since the misbehaviour can't affect simulation correctness, I think the appropriate place to attempt fix this is release-2016 branch.
Sure, I can move it to the current release.
#5 Updated by Mark Abraham over 3 years ago
Szilard's fix on release-2016 does produce a suitable error for the first two cases.
With that (and perhaps before?), the third case does produce an error, but obviously not optimal:
gmx mdrun -nb cpu -gpu_id 0 -deffnm nonanol_vacuo Running on 1 node with total 4 cores, 8 logical cores (GPU detection deactivated) Hardware detected: CPU info: Vendor: Intel Brand: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz SIMD instructions most likely to fit this hardware: AVX_256 SIMD instructions selected at GROMACS compile time: AVX_256 Hardware topology: Full, with devices Reading file nonanol_vacuo.tpr, VERSION 5.1.3 (single precision) Note: file tpx version 103, software tpx version 110 Can not increase nstlist because an NVE ensemble is used Using 1 MPI thread Using 8 OpenMP threads ------------------------------------------------------- Program: gmx mdrun, version 2016.2-dev-20170203-ba59db75b Source file: src/gromacs/hardware/detecthardware.cpp (line 375) Fatal error: Can't use the requested GPU device(s) (passed IDs: 0), no suitable devices were detected! For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors -------------------------------------------------------
#7 Updated by Szilárd Páll over 3 years ago
Mark Abraham wrote:
Since our test harnesses depend on at least some of this behaviour that we now consider buggy, we'll have to take some more time over considering how to fix anything.
I agree. Perhaps even worth considering whether to fix anything in the release branch instead of just documenting the peculiar mdrun behavior wrt GPU options (i.e. that when the GPU detection fails or HW compatibility/sanity checks are not passed, -nb and -gpu_id are simply ignored).