Task #2675: bonded CUDA offload task
bonded force reduction
- if we schedule in a separate stream, we can:
- use atomic ops for force accumulation (simple but likely less efficient)
- use a separate reduction kernel after the nonbondeds complete (possibly use the former with DD to have the results quicker?)
Note: the current draft contrib code (https://gerrit.gromacs.org/#/c/8460/4/src/gromacs/mdlib/nbnxn_cuda/gpuBondedCUDA.cu) uses a reduction stage after each an every bonded kernel which would be bestreconsidered, but if time is short we could leave it as is.
Add CUDA bonded kernels
CUDA bonded kernels are added for the most common bonded and LJ-14
The default auto settings of mdrun offloads these interactions
to the GPU when possible.
Currently these interactions are computed in the local or non-local
nbnxn non-bonded streams. We should consider using a separate stream.
This change uses synchronous transfers. A child change will change
these to asynchronous.
Updated release notes and performance guide.
#2 Updated by Szilárd Páll over 2 years ago
- Category set to mdrun
- Target version set to 2019-beta1
We've looked at the atomics-based reduction and it seems to be a lot faster than the current naive one, so we'll go with that. This means that the remaining task is to decide on how will we do the final force reduction. There are two important distinct cases:
- without DD with NB and PME also running on the same device: could share force output with PME?
- with DD when non-local forces are needed early for comm: need separate buffer possibly further reduced on the CPU-side