Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms
GPU X/F buffer ops
Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (
- Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
- Similarly, the force layout transform kernel will allow direct force communication. This transform can be combined with reduction across force buffers. Multiple flavors and implementation strategies to be considered:
- only transform (e.g. if no other force compute on the GPU)
- transform + reduce by accumulating into the f buffer output on the GPU (e.g. reduce the NB and the RVec f PME GPU output);
- transform + reduce multiple force buffers: reduce the result of force calculation outputs from different memory spaces (special forces on CPU, PME on separate GPU, etc.)
- the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
- consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
- need to improve resolve ownership of GPU input/outputs
- pinning for currently not pinned/pinnable search data
Position buffer ops in CUDA
- improve the CUDA kernel
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).
Implements part of #2817
#7 Updated by Szilárd Páll 17 days ago
Alan Gray wrote:
Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.
Have you considered also decoupling the GPU-side reduction with CPU forces? This would allow the change to go in independently from all other changes required. Also, if you keep the reduction, do consider changes I76e4b954b4ef045f299a8496b4975497720f4b89 (https://gerrit.gromacs.org/#/c/9126/) and Ie49c0fc483b274ac17e6ace9ca495c11dc719532 (must be a draft).
#10 Updated by Szilárd Páll 13 days ago
Mark Abraham wrote:
I tried to add links to Szilard's text, but one of the patches is a draft
Yes, this was one of the set of changes I uploaded as suggested improvements to the (then single) buffer ops change. The mini-branch of recommendations were kept as drafts and were never intended to undergo code review, so they would just be noise on the already noise gerrit site. If there is interest I can share them wider, but the content should anyway end up in new changes intended fr review.
F buffer ops (without PME reduction) patch now updated to include vectorization within kernel and use of haveCpuForces flag, which is currently always set to "true" such that H2D transfer is activated before buffer op, awaiting force workload patch for avaiability of (haveSpecialForces || haveCpuBondedWork).
I considered further splitting this to separate the buffer ops and reduction, but it doesn't really improve isolation because haveCpuForces would still be required to determine if an extra reduction across GPU forces and CPU forces is required.
Now awaiting review.