Feature #3029

Updated by Szilárd Páll 4 months ago

Implement the force buffer layout transform on GPU (from nbmxn to native layout); as in case of the CPU, this tasks could in some cases be combined with the force reduction to produce the final force-buffer that the integration can take as input.

Use-cases (TODO list all use-cases/subcases):
# buffer ops transform-only on GPU
* optional D2H transfer of force buffer (by default transfer)
# reduce with other on-GPU forces (currently PME)
* needs efficient on-device sync
* needs to make sure that the reduced forces are not D2H trasferred
* needs to consider virial calculation (to be done on the bonded + nonbonded forces)
# reduction with CPU-side forces
* ensure that CPU-side forces H2D is issued as early as possible
* also consider virial contribution
* evaluate whether this is more efficient to do when DD halo exchange is anyway doing staged copy and forces are already in cache (and need to be in CPU cache for MPI);

General considerations:
* launch the transform kernel back-to-back after the nonbonded rather than later, next to the CPU buffer ops/reduction
* evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)
* -implement implement on-GPU event sync between PME and reduction task in the nonbonded queue (see fully functional prototype for the coordianate dependency shared earlier here: