incorrect runtime assertion catches CUDA API errors from GPU sanity checking
The compatibility/sanity chcking implemented in is_gmx_supported_gpu_id() leaves the CUDA runtime status contain the last error when the checks get interrupted by an API error and an "insane" state would be reported. However, the called,
findGpus() runtime asserts on the API state which means that it will catch and abort on errors that should not be fatal.
As a result, runs that detect an error during GPU detection will abort instead of skipping the device(s) that can't be used.
Avoid aborting mdrun when GPU sanity check detects errors
A release assertion was added which assumed that the GPU
compatibility/sanity checks return with a clean CUDA API state.
Consequently, any run that encountered a non-success return value from
the CUDA API would abort the run instead of continuing the run without
using the GPU in question.
This change adds code to handle and issue a note on the error
encountered as well as ensures that the CUDA API error state cleared
at the return of the GPU detection.
Improve GPU detection sanity check error message
When the unexpected condition is triggered some extra info on what type
of error has been left behind after a successful detection of a
compatible GPU is now printed to aid with identifying issues.
#7 Updated by Mark Abraham 8 months ago
- Category set to core library
- Status changed from Closed to Accepted
- Target version changed from 2018.1 to 2018.3
- Affected version changed from 2018 to 2018.2
Jia Hong on gmx-users is observing failure of the assertion added in the previous fix of this issue, in a case where GPU 0 is too old and GPU 1 is new, so something probably needs work.