Project

General

Profile

Bug #2415

incorrect runtime assertion catches CUDA API errors from GPU sanity checking

Added by Szilárd Páll over 1 year ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Category:
core library
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The compatibility/sanity chcking implemented in is_gmx_supported_gpu_id() leaves the CUDA runtime status contain the last error when the checks get interrupted by an API error and an "insane" state would be reported. However, the called, findGpus() runtime asserts on the API state which means that it will catch and abort on errors that should not be fatal.

As a result, runs that detect an error during GPU detection will abort instead of skipping the device(s) that can't be used.

Associated revisions

Revision 74400c15 (diff)
Added by Szilárd Páll over 1 year ago

Avoid aborting mdrun when GPU sanity check detects errors

A release assertion was added which assumed that the GPU
compatibility/sanity checks return with a clean CUDA API state.
Consequently, any run that encountered a non-success return value from
the CUDA API would abort the run instead of continuing the run without
using the GPU in question.
This change adds code to handle and issue a note on the error
encountered as well as ensures that the CUDA API error state cleared
at the return of the GPU detection.

Fixes #2415

Change-Id: I5d7ed59ef8e4052a75b51c9a526b8dcb465ff611

Revision 6a897857 (diff)
Added by Szilárd Páll about 1 year ago

Improve GPU detection sanity check error message

When the unexpected condition is triggered some extra info on what type
of error has been left behind after a successful detection of a
compatible GPU is now printed to aid with identifying issues.

Refs #2415

Change-Id: I85e0da4c339df8184aa2dec49440ce2d0e83e8bf

History

#1 Updated by Szilárd Páll over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Szilárd Páll

#2 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related DRAFT patchset '1' for Issue #2415.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~I5d7ed59ef8e4052a75b51c9a526b8dcb465ff611
Gerrit URL: https://gerrit.gromacs.org/7594

#3 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2415.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I5d7ed59ef8e4052a75b51c9a526b8dcb465ff611
Gerrit URL: https://gerrit.gromacs.org/7595

#4 Updated by Szilárd Páll over 1 year ago

  • Status changed from In Progress to Resolved

#5 Updated by Mark Abraham over 1 year ago

  • Status changed from Resolved to Closed

#6 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2415.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~Idc28c7e89a5f08ee1d19943e3663385f2b23ff44
Gerrit URL: https://gerrit.gromacs.org/7621

#7 Updated by Mark Abraham about 1 year ago

  • Category set to core library
  • Status changed from Closed to Accepted
  • Target version changed from 2018.1 to 2018.3
  • Affected version changed from 2018 to 2018.2

Jia Hong on gmx-users is observing failure of the assertion added in the previous fix of this issue, in a case where GPU 0 is too old and GPU 1 is new, so something probably needs work.

#8 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '2' for Issue #2415.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2018~I85e0da4c339df8184aa2dec49440ce2d0e83e8bf
Gerrit URL: https://gerrit.gromacs.org/8172

#9 Updated by Szilárd Páll about 1 year ago

  • Status changed from Accepted to Blocked, need info

No matter what I do I could not repro the error and the user report is incomplete, so this needs more info.

#10 Updated by Paul Bauer about 1 year ago

  • Target version changed from 2018.3 to 2018.4

Moving this to the next point release then

#11 Updated by Paul Bauer 10 months ago

Szilard, do you agree on closing this then?

#12 Updated by Szilárd Páll 9 months ago

  • Status changed from Blocked, need info to Closed

Paul Bauer wrote:

Szilard, do you agree on closing this then?

Yes.

Also available in: Atom PDF