improve GPU error handling and make such handling uniform across modules
We have made a start to upgrade GPU infrastructure and evolve infrastructure that works similarly across CUDA and OpenCL, and eventually both new and old modules. However it's incomplete and inconsistently used across modules, which is surely confusing and complex to people new to the GPU code (e.g. Artem, Jon, Alan, Roland). Sometimes we assert, sometimes we throw, sometimes we have error codes and strings from the underlying API, sometimes not.
At #2762, we noticed that
Program: gmx mdrun, version 2019-beta2 Source file: src/gromacs/gpu_utils/devicebuffer.cuh (line 131) Function: copyToDeviceBuffer(ValueType**, const ValueType*, size_t, size_t, CommandStream, GpuApiCallBehavior, CommandEvent*)::<lambda()> [with ValueType = nbnxn_cj4_t] Assertion failed: Condition: stat == cudaSuccess Asynchronous H2D copy failed
is better because the nbnxn_cj4_t adds useful context, but worse because the error code / string is not reported, compared with
Program: gmx mdrun, version 2018.3 Source file: src/gromacs/gpu_utils/cudautils.cu (line 110) Fatal error: HtoD cudaMemcpyAsync failed: invalid argument
In the longer term, it is clear that we need to transform all such error handling to throw. We will have to transition in that direction over the next year, because API-based runs will not be able to take down the entire workflow because a GPU driver gets scrambled and triggered an assertion or fatal error. It needs to transfer control back to the point where something understands whether the user required GPU execution (which is now impossible), or not (so the simulation should resume as best it can without that GPU / without GPUs).
We need to decide how to handle non-successful error codes. The underlying APIs document what error codes can be returned, but that can change across e.g CUDA versions, and we should respond differently to an error for which we have chosen a response (e.g. it can only happen if we have made a coding error, so we should assert), than one for which we haven't chosen a response (because it's an API error code that is only returned in newer versions). Some error states are recoverable within a running simulation (e.g. if we can't pin memory on a node, then the management code should trigger that we fall back to synchronous transfers), others need higher level coordinate (see above).
Aspects of https://github.com/gromacs/gromacs/blob/release-2019/src/gromacs/gpu_utils/pinning.cu#L68 and https://github.com/gromacs/gromacs/blob/release-2019/src/gromacs/gpu_utils/devicebuffer.cuh#L108 are useful (while needing improvement; suggestions welcome of course), but should probably be complemented by something that catches e.g. an exception thrown by copyToDeviceBuffer and re-throws something that clarifies which transfer from which module has failed. Modern exception implementations are all low cost when the exception is not thrown, and any alternative approach for error handling has to a) handle errors adequately in real world large applications, and b) not be slow, and c) leave the "happy path" readily understood by developers.
For error states that have an appreciable chance of happening (e.g. failure to pin) we probably should use explicit non-throwing logic that tries to pin the relevant set of buffers, and if all have succeeded, then triggers the use of a code path that does asynchronous transfers. Exceptions should handle genuinely unexpected cases, and not those that are occasioned by a coding error (use assertions).
Once we've had some discussion here, I will evolve a policy that we document (or someone else can step up), and new GPU-related code should be expected to conform.