Project

General

Profile

Bug #3178

Fatal Error when launching mdrun on host with busy/unavailable GPU(s)

Added by Artem Shekhovtsov 3 months ago. Updated 18 days ago.

Status:
Accepted
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The launch of mdrun that does not require video cards exit with a fatal error if at least one video card is busy on the host at that time.

gmx grompp -f test.mdp -c spc216.gro -p topol.top -o test.tpr
gmx mdrun -deffnm test -ntmpi 1 -ntomp 1 -nb cpu -bonded cpu

Result:
-------------------------------------------------------
Program: gmx mdrun, version 2019.2
Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 100)

Fatal error:
cudaFuncGetAttributes failed: all CUDA-capable devices are busy or
unavailable

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

I have this error in 2019.2, 2019.3, 2020.beta.
Version - 2018.6 is not affected.
All version builds with the same flags.
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DREGRESSIONTEST_DOWNLOAD=ON -DGMX_GPU=on -DCMAKE_INSTALL_PREFIX=/data/user/shehovtsov/SOFTWARE/GROMACS/2019.2_test

test.mdp (2.37 KB) test.mdp Artem Shekhovtsov, 10/25/2019 04:48 PM
spc216.gro (28.6 KB) spc216.gro Artem Shekhovtsov, 10/25/2019 04:48 PM
topol.top (368 Bytes) topol.top Artem Shekhovtsov, 10/25/2019 04:48 PM
test.log (3.65 KB) test.log Artem Shekhovtsov, 10/25/2019 04:49 PM
check_log (333 KB) check_log Artem Shekhovtsov, 10/25/2019 04:52 PM
cmake_log (12.2 KB) cmake_log Artem Shekhovtsov, 10/25/2019 04:52 PM
make_log (873 KB) make_log Artem Shekhovtsov, 10/25/2019 04:52 PM
printenv (5.34 KB) printenv Artem Shekhovtsov, 10/25/2019 04:55 PM

History

#1 Updated by Szilárd Páll 3 months ago

  • Subject changed from Fatal Error when launching gromacs 2019.2 on host with GPU. to Fatal Error when launching mdrun on host with busy/unavailable GPU(s)

This is caused by the sanity checking expecting cudaSuccess to be returned by all API calls whereas some of the sanity checking steps may be prevented by devices being in exclusive mode.

#2 Updated by Szilárd Páll 3 months ago

  • Category set to mdrun
  • Status changed from New to Accepted
  • Assignee set to Szilárd Páll
  • Target version set to 2019.5

#3 Updated by Paul Bauer about 1 month ago

  • Target version changed from 2019.5 to 2020

unlikely to be fixed for 2019.5, more likely to be in 2020 or 2020.1

#4 Updated by Artem Zhmurov about 1 month ago

I uploaded a fix: https://gerrit.gromacs.org/#/c/gromacs/+/14921/ Any ideas how we can test it?

#5 Updated by Mark Abraham about 1 month ago

Artem Zhmurov wrote:

I uploaded a fix: https://gerrit.gromacs.org/#/c/gromacs/+/14921/ Any ideas how we can test it?

Put a device in exclusive mode and run another mdrun on it?

#6 Updated by Paul Bauer about 1 month ago

  • Target version changed from 2020 to 2020.1

I don't think this will make 2020

#7 Updated by Artem Zhmurov 28 days ago

My fix (https://gerrit.gromacs.org/#/c/gromacs/+/14921/) does not work. If I set GPU to EXCLUSIVE_PROCESS mode (using nvidia-smi --id=0 --compute-mode=EXCLUSIVE_PROCESS), ant try to run two instances of mdrun, the second one fails with std::bad_alloc. Even if all the flags are set to the CPU.

#8 Updated by Szilárd Páll 18 days ago

Artem Zhmurov wrote:

My fix (https://gerrit.gromacs.org/#/c/gromacs/+/14921/) does not work. If I set GPU to EXCLUSIVE_PROCESS mode (using nvidia-smi --id=0 --compute-mode=EXCLUSIVE_PROCESS), ant try to run two instances of mdrun, the second one fails with std::bad_alloc. Even if all the flags are set to the CPU.

Where is the std::bad_alloc originating from?

Also available in: Atom PDF