Project

General

Profile

Feature #2817

Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms

GPU X/F buffer ops

Added by Szilárd Páll 5 months ago. Updated 2 months ago.

Status:
Accepted
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (nbnxn_atomdata_add_nbat_f_to_f and nbnxn_atomdata_copy_x_to_nbat_x).

Role and scope:
  • Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
  • Similarly, the force layout transform kernel will allow direct force communication. This transform can be combined with reduction across force buffers. Multiple flavors and implementation strategies to be considered:
    - only transform (e.g. if no other force compute on the GPU)
    - transform + reduce by accumulating into the f buffer output on the GPU (e.g. reduce the NB and the RVec f PME GPU output);
    - transform + reduce multiple force buffers: reduce the result of force calculation outputs from different memory spaces (special forces on CPU, PME on separate GPU, etc.)
    - the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
    - consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
Related TODOs:
  • need to improve resolve ownership of GPU input/outputs
  • pinning for currently not pinned/pinnable search data

Subtasks

Feature #2934: GPU X Buffer opsNew

Associated revisions

Revision 20934303 (diff)
Added by Alan Gray 2 months ago

Position buffer ops in CUDA

TODO:
- improve the CUDA kernel

Note:
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).

Implements part of #2817

Change-Id: Ib87dabd74a02727898681249691ac9786b8ac65c

Revision 3cced793 (diff)
Added by Alan Gray 2 months ago

F buffer operations in CUDA

TODO: split out reduction

Implements part of #2817

Change-Id: I80c95438e44167b6a9d9d74c27709379f6665867

Revision 42b343b9 (diff)
Added by Alan Gray 10 days ago

Position buffer ops in CUDA

On all but search steps the buffer ops transform can now be done on a
CUDA GPU. If PME runs on the same GPU the already uploaded coordinates
will be used as input.

Activate with GMX_USE_GPU_BUFFER_OPS env variable.

Note:
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).

Implements part of #2817

Change-Id: Ib87dabd74a02727898681249691ac9786b8ac65c

History

#1 Updated by Szilárd Páll 5 months ago

  • Description updated (diff)

#2 Updated by Szilárd Páll 3 months ago

  • Description updated (diff)
  • Status changed from New to Accepted

#3 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '5' for Issue #2817.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~Ib87dabd74a02727898681249691ac9786b8ac65c
Gerrit URL: https://gerrit.gromacs.org/9169

#4 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '3' for Issue #2817.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~I80c95438e44167b6a9d9d74c27709379f6665867
Gerrit URL: https://gerrit.gromacs.org/9170

#5 Updated by Alan Gray 3 months ago

Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.

#6 Updated by Gerrit Code Review Bot 2 months ago

Gerrit received a related patchset '1' for Issue #2817.
Uploader: Alan Gray ()
Change-Id: gromacs~master~Ice984425301d24bac1340e883698244489cd686e
Gerrit URL: https://gerrit.gromacs.org/9275

#7 Updated by Szilárd Páll 2 months ago

Alan Gray wrote:

Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.

Have you considered also decoupling the GPU-side reduction with CPU forces? This would allow the change to go in independently from all other changes required. Also, if you keep the reduction, do consider changes I76e4b954b4ef045f299a8496b4975497720f4b89 (https://gerrit.gromacs.org/#/c/9126/) and Ie49c0fc483b274ac17e6ace9ca495c11dc719532 (must be a draft).

#8 Updated by Alan Gray 2 months ago

OK, thanks - I will think about this and adjust the new patch accordingly.

#9 Updated by Mark Abraham 2 months ago

I tried to add links to Szilard's text, but one of the patches is a draft

#10 Updated by Szilárd Páll 2 months ago

Mark Abraham wrote:

I tried to add links to Szilard's text, but one of the patches is a draft

Yes, this was one of the set of changes I uploaded as suggested improvements to the (then single) buffer ops change. The mini-branch of recommendations were kept as drafts and were never intended to undergo code review, so they would just be noise on the already noise gerrit site. If there is interest I can share them wider, but the content should anyway end up in new changes intended fr review.

#11 Updated by Alan Gray 2 months ago

F buffer ops (without PME reduction) patch now updated to include vectorization within kernel and use of haveCpuForces flag, which is currently always set to "true" such that H2D transfer is activated before buffer op, awaiting force workload patch for avaiability of (haveSpecialForces || haveCpuBondedWork).

I considered further splitting this to separate the buffer ops and reduction, but it doesn't really improve isolation because haveCpuForces would still be required to determine if an extra reduction across GPU forces and CPU forces is required.

Now awaiting review.

#12 Updated by Alan Gray 2 months ago

Now rebased such that (haveSpecialForces || haveCpuBondedWork) is available. But we can't use it just yet to make the H2D transfer conditional until the PME reduction is also integrated into the buffer ops. So still setting haveCpuForces=true for now.

Also available in: Atom PDF