Project

General

Profile

Feature #2817

Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms

GPU X/F buffer ops

Added by Szilárd Páll 7 months ago. Updated 18 days ago.

Status:
In Progress
Priority:
High
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (nbnxn_atomdata_add_nbat_f_to_f and nbnxn_atomdata_copy_x_to_nbat_x).

Role and scope:
  • Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
  • Similarly, the force layout transform kernel will allow direct force communication. This transform can be combined with reduction across force buffers. Multiple flavors and implementation strategies to be considered:
    - only transform (e.g. if no other force compute on the GPU)
    - transform + reduce by accumulating into the f buffer output on the GPU (e.g. reduce the NB and the RVec f PME GPU output);
    - transform + reduce multiple force buffers: reduce the result of force calculation outputs from different memory spaces (special forces on CPU, PME on separate GPU, etc.)
    - the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
    - consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
Related TODOs:
  • need to improve resolve ownership of GPU input/outputs
  • pinning for currently not pinned/pinnable search data

Subtasks

Feature #2934: GPU X Buffer opsNew
Task #3026: add flags for GPU force buffer op / reduction activationNew
Feature #3029: GPU force buffer ops + reduction In Progress
Task #3037: add missing cylcle counters related to buffer ops/reduction launchesNew

Associated revisions

Revision 20934303 (diff)
Added by Alan Gray 4 months ago

Position buffer ops in CUDA

TODO:
- improve the CUDA kernel

Note:
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).

Implements part of #2817

Change-Id: Ib87dabd74a02727898681249691ac9786b8ac65c

Revision 3cced793 (diff)
Added by Alan Gray 4 months ago

F buffer operations in CUDA

TODO: split out reduction

Implements part of #2817

Change-Id: I80c95438e44167b6a9d9d74c27709379f6665867

Revision 42b343b9 (diff)
Added by Alan Gray 2 months ago

Position buffer ops in CUDA

On all but search steps the buffer ops transform can now be done on a
CUDA GPU. If PME runs on the same GPU the already uploaded coordinates
will be used as input.

Activate with GMX_USE_GPU_BUFFER_OPS env variable.

Note:
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).

Implements part of #2817

Change-Id: Ib87dabd74a02727898681249691ac9786b8ac65c

Revision 3329a50b (diff)
Added by Szilárd Páll 12 days ago

Use HostVector for Grid/GridSet data need on-GPU

Grid.cxy_na_, Grid.cxy_ind_, GridSet.cells and GridSet.atomIndices
have been converted from std::vector to gmx::HostVector. This allow
the code to pin the HostVector when X buffer ops is used and to
eliminate the hacky pin/unpin in CUDA buffer ops functions.

Part of #2934
Refs #2817

Change-Id: Icca21dd076128ec582f805ed96e253dfab461270

Revision 8e83edea (diff)
Added by Alan Gray 4 days ago

F buffer operations in CUDA

This patch performs GPU buffer ops for force buffers.

Enable with GMX_USE_GPU_BUFFER_OPS env variable.

Currently, the H2D transfer of the force buffer is switched on with
haveSpecialForces || haveCpuBondedWork || haveCpuPmeWork,
where haveCpuPmeWork is true even when useGpuPme == true
until on-GPU PME-nonbonded reduction is added in follow-up.

TODO: enable PME reduction in GPU buffer ops and remove associated H2D
transfer

Implements part of #2817

Change-Id: Ice984425301d24bac1340e883698244489cd686e

Revision c8951db1 (diff)
Added by Szilárd Páll 1 day ago

Conditionally pin GPU-related grid data

Data that is transferred to the GPU when the buffer ops is offloaded is
now only pinned when the nonbonded module uses GPU offload avoidign the
runtime errors encountered when a GPU-enabled build does not detect a
GPU and therefore the CUDA runtime refuses to register the memory.

Refs #2817 #2934

Change-Id: Iabbc0d9f37fad0e88cd39a078af1346e8f713ec1

History

#1 Updated by Szilárd Páll 7 months ago

  • Description updated (diff)

#2 Updated by Szilárd Páll 6 months ago

  • Description updated (diff)
  • Status changed from New to Accepted

#3 Updated by Gerrit Code Review Bot 5 months ago

Gerrit received a related patchset '5' for Issue #2817.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~Ib87dabd74a02727898681249691ac9786b8ac65c
Gerrit URL: https://gerrit.gromacs.org/9169

#4 Updated by Gerrit Code Review Bot 5 months ago

Gerrit received a related patchset '3' for Issue #2817.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~I80c95438e44167b6a9d9d74c27709379f6665867
Gerrit URL: https://gerrit.gromacs.org/9170

#5 Updated by Alan Gray 5 months ago

Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.

#6 Updated by Gerrit Code Review Bot 5 months ago

Gerrit received a related patchset '1' for Issue #2817.
Uploader: Alan Gray ()
Change-Id: gromacs~master~Ice984425301d24bac1340e883698244489cd686e
Gerrit URL: https://gerrit.gromacs.org/9275

#7 Updated by Szilárd Páll 5 months ago

Alan Gray wrote:

Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.

Have you considered also decoupling the GPU-side reduction with CPU forces? This would allow the change to go in independently from all other changes required. Also, if you keep the reduction, do consider changes I76e4b954b4ef045f299a8496b4975497720f4b89 (https://gerrit.gromacs.org/#/c/9126/) and Ie49c0fc483b274ac17e6ace9ca495c11dc719532 (must be a draft).

#8 Updated by Alan Gray 5 months ago

OK, thanks - I will think about this and adjust the new patch accordingly.

#9 Updated by Mark Abraham 4 months ago

I tried to add links to Szilard's text, but one of the patches is a draft

#10 Updated by Szilárd Páll 4 months ago

Mark Abraham wrote:

I tried to add links to Szilard's text, but one of the patches is a draft

Yes, this was one of the set of changes I uploaded as suggested improvements to the (then single) buffer ops change. The mini-branch of recommendations were kept as drafts and were never intended to undergo code review, so they would just be noise on the already noise gerrit site. If there is interest I can share them wider, but the content should anyway end up in new changes intended fr review.

#11 Updated by Alan Gray 4 months ago

F buffer ops (without PME reduction) patch now updated to include vectorization within kernel and use of haveCpuForces flag, which is currently always set to "true" such that H2D transfer is activated before buffer op, awaiting force workload patch for avaiability of (haveSpecialForces || haveCpuBondedWork).

I considered further splitting this to separate the buffer ops and reduction, but it doesn't really improve isolation because haveCpuForces would still be required to determine if an extra reduction across GPU forces and CPU forces is required.

Now awaiting review.

#12 Updated by Alan Gray 4 months ago

Now rebased such that (haveSpecialForces || haveCpuBondedWork) is available. But we can't use it just yet to make the H2D transfer conditional until the PME reduction is also integrated into the buffer ops. So still setting haveCpuForces=true for now.

#13 Updated by Szilárd Páll about 1 month ago

Issues identified in F buffer ops + reduction code; review and resolution done/pending:
  • on virial steps PME forced are needed separately
  • for DD separate short-range forces are needed separately

For this reason, I suggest to add in a separate change, a code-path/kernel flavor that does out of place short-/long-range force reduction. It would be good to have a child F buf ops redmine too track technical issues with the feature.

#14 Updated by Szilárd Páll 28 days ago

Having looked further into the virial step issues here is what we came up with:
  • on virial steps for now turn off GPU buffer ops instead of shuffling around data to cater for the CPU-side virial reduction
  • next, port the virial reduction to the GPU (likely best called from / fused with the reduction kernel); see current code in sim_util.cpp:calc_virial().

#15 Updated by Szilárd Páll 18 days ago

  • Status changed from Accepted to In Progress
  • Target version set to 2020-beta1

Also available in: Atom PDF