Project

General

Profile

Bug #3178

Fatal Error when launching mdrun on host with busy/unavailable GPU(s)

Added by Artem Shekhovtsov 10 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The launch of mdrun that does not require video cards exit with a fatal error if at least one video card is busy on the host at that time.

gmx grompp -f test.mdp -c spc216.gro -p topol.top -o test.tpr
gmx mdrun -deffnm test -ntmpi 1 -ntomp 1 -nb cpu -bonded cpu

Result:
-------------------------------------------------------
Program: gmx mdrun, version 2019.2
Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 100)

Fatal error:
cudaFuncGetAttributes failed: all CUDA-capable devices are busy or
unavailable

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

I have this error in 2019.2, 2019.3, 2020.beta.
Version - 2018.6 is not affected.
All version builds with the same flags.
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_GPU=on -DCMAKE_INSTALL_PREFIX=/data/user/shehovtsov/SOFTWARE/GROMACS/2019.2_test

#report #request

test.mdp (2.37 KB) test.mdp Artem Shekhovtsov, 10/25/2019 04:48 PM
spc216.gro (28.6 KB) spc216.gro Artem Shekhovtsov, 10/25/2019 04:48 PM
topol.top (368 Bytes) topol.top Artem Shekhovtsov, 10/25/2019 04:48 PM
test.log (3.65 KB) test.log Artem Shekhovtsov, 10/25/2019 04:49 PM
check_log (333 KB) check_log Artem Shekhovtsov, 10/25/2019 04:52 PM
cmake_log (12.2 KB) cmake_log Artem Shekhovtsov, 10/25/2019 04:52 PM
make_log (873 KB) make_log Artem Shekhovtsov, 10/25/2019 04:52 PM
printenv (5.34 KB) printenv Artem Shekhovtsov, 10/25/2019 04:55 PM

Related issues

Related to GROMACS - Bug #3399: out of memory errors cause abort during GPU detectionClosed

Associated revisions

Revision 0a9b0ba7 (diff)
Added by Szilárd Páll 6 months ago

Avoid mdrun terminate due to GPU sanity check errors

When a GPU is a exclusive or prohibited mode, early detection calls can
fail and as a result an mdrun run abort with an error, even if all GPU
offload is explicitly disabled by the user.
This change adds a status code to handle the case of devices being
unavailable.

Additionally, other errors may be encountered during the dummy kernel
sanity check (e.g. out of memory), but since the change that switches
to using launchGpuKernel() wrapper did not handle the exception in the
sanity checking, this can also abort a run even if the GPU in question
is not selected to be used.
This change adds code to catch the exception this and report the error
and avoid abort the run.

Fixes #3178 #3399

Change-Id: I0cdedbc02769084c172e4a42fe5c1af192007cec

History

#1 Updated by Szilárd Páll 10 months ago

  • Subject changed from Fatal Error when launching gromacs 2019.2 on host with GPU. to Fatal Error when launching mdrun on host with busy/unavailable GPU(s)

This is caused by the sanity checking expecting cudaSuccess to be returned by all API calls whereas some of the sanity checking steps may be prevented by devices being in exclusive mode.

#2 Updated by Szilárd Páll 10 months ago

  • Category set to mdrun
  • Status changed from New to Accepted
  • Assignee set to Szilárd Páll
  • Target version set to 2019.5

#3 Updated by Paul Bauer 8 months ago

  • Target version changed from 2019.5 to 2020

unlikely to be fixed for 2019.5, more likely to be in 2020 or 2020.1

#4 Updated by Artem Zhmurov 8 months ago

I uploaded a fix: https://gerrit.gromacs.org/#/c/gromacs/+/14921/ Any ideas how we can test it?

#5 Updated by Mark Abraham 8 months ago

Artem Zhmurov wrote:

I uploaded a fix: https://gerrit.gromacs.org/#/c/gromacs/+/14921/ Any ideas how we can test it?

Put a device in exclusive mode and run another mdrun on it?

#6 Updated by Paul Bauer 8 months ago

  • Target version changed from 2020 to 2020.1

I don't think this will make 2020

#7 Updated by Artem Zhmurov 8 months ago

My fix (https://gerrit.gromacs.org/#/c/gromacs/+/14921/) does not work. If I set GPU to EXCLUSIVE_PROCESS mode (using nvidia-smi --id=0 --compute-mode=EXCLUSIVE_PROCESS), ant try to run two instances of mdrun, the second one fails with std::bad_alloc. Even if all the flags are set to the CPU.

#8 Updated by Szilárd Páll 7 months ago

Artem Zhmurov wrote:

My fix (https://gerrit.gromacs.org/#/c/gromacs/+/14921/) does not work. If I set GPU to EXCLUSIVE_PROCESS mode (using nvidia-smi --id=0 --compute-mode=EXCLUSIVE_PROCESS), ant try to run two instances of mdrun, the second one fails with std::bad_alloc. Even if all the flags are set to the CPU.

Where is the std::bad_alloc originating from?

#9 Updated by Szilárd Páll 6 months ago

  • Related to Bug #3399: out of memory errors cause abort during GPU detection added

#10 Updated by Szilárd Páll 6 months ago

  • Description updated (diff)
  • Status changed from Accepted to In Progress

#11 Updated by Szilárd Páll 6 months ago

  • Status changed from In Progress to Resolved

#12 Updated by Paul Bauer 6 months ago

  • Status changed from Resolved to Closed

#13 Updated by Szilárd Páll 5 months ago

  • Description updated (diff)

Also available in: Atom PDF