Feature #2817
Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication
GPU X/F buffer ops
Description
Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (nbnxn_atomdata_add_nbat_f_to_f
and nbnxn_atomdata_copy_x_to_nbat_x
).
- Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
- Similarly, the force layout transform kernel will allow direct force communication. This transform can be combined with reduction across force buffers. Multiple flavors and implementation strategies to be considered:
- only transform (e.g. if no other force compute on the GPU)
- transform + reduce by accumulating into the f buffer output on the GPU (e.g. reduce the NB and the RVec f PME GPU output);
- transform + reduce multiple force buffers: reduce the result of force calculation outputs from different memory spaces (special forces on CPU, PME on separate GPU, etc.)
- the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
- consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
- need to improve resolve ownership of GPU input/outputs [WIP]
pinning for currently not pinned/pinnable search data- Ideally the force-reduction should not be called from a method of the nonbonded module (especially due to the complexities of CPU/GPU code-paths) - consider reorganizing reductions
Subtasks
Related issues
Associated revisions
F buffer operations in CUDA
TODO: split out reduction
Implements part of #2817
Change-Id: I80c95438e44167b6a9d9d74c27709379f6665867
Position buffer ops in CUDA
On all but search steps the buffer ops transform can now be done on a
CUDA GPU. If PME runs on the same GPU the already uploaded coordinates
will be used as input.
Activate with GMX_USE_GPU_BUFFER_OPS env variable.
Note:
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).
Implements part of #2817
Change-Id: Ib87dabd74a02727898681249691ac9786b8ac65c
Use HostVector for Grid/GridSet data need on-GPU
Grid.cxy_na_, Grid.cxy_ind_, GridSet.cells and GridSet.atomIndices
have been converted from std::vector to gmx::HostVector. This allow
the code to pin the HostVector when X buffer ops is used and to
eliminate the hacky pin/unpin in CUDA buffer ops functions.
Change-Id: Icca21dd076128ec582f805ed96e253dfab461270
F buffer operations in CUDA
This patch performs GPU buffer ops for force buffers.
Enable with GMX_USE_GPU_BUFFER_OPS env variable.
Currently, the H2D transfer of the force buffer is switched on with
haveSpecialForces || haveCpuBondedWork || haveCpuPmeWork,
where haveCpuPmeWork is true even when useGpuPme == true
until on-GPU PME-nonbonded reduction is added in follow-up.
TODO: enable PME reduction in GPU buffer ops and remove associated H2D
transfer
Implements part of #2817
Change-Id: Ice984425301d24bac1340e883698244489cd686e
Conditionally pin GPU-related grid data
Data that is transferred to the GPU when the buffer ops is offloaded is
now only pinned when the nonbonded module uses GPU offload avoidign the
runtime errors encountered when a GPU-enabled build does not detect a
GPU and therefore the CUDA runtime refuses to register the memory.
Change-Id: Iabbc0d9f37fad0e88cd39a078af1346e8f713ec1
PME reduction for CUDA F buffer operations
Enable with GMX_USE_GPU_BUFFER_OPS env variable.
Provides functionality to perform reduction of PME forces in F buffer
ops kernel. Currently active when single GPU performs both PME and PP
(multi-GPU support will follow in patch which perfoms PME/PP comms
direct between GPUs). When active, Device->Host copy of PME force
and CPU-side reduction is disabled.
Implements part of #3029, refs #2817
Change-Id: I3e66b6919c1e86bf0bed42b74136f8694626910b
PME reduction for CUDA F buffer operations
Enable with GMX_USE_GPU_BUFFER_OPS env variable.
Provides functionality to perform reduction of PME forces in F buffer
ops kernel. Currently active when single GPU performs both PME and PP
(multi-GPU support will follow in patch which perfoms PME/PP comms
direct between GPUs). When active, Device->Host copy of PME force
and CPU-side reduction is disabled.
Implements part of #2817
Change-Id: I3e66b6919c1e86bf0bed42b74136f8694626910b
PME/PP GPU Comms for force buffer
Activate with GMX_GPU_PME_PP_COMMS env variable
Performs scatter of force buffer data from PME task to PP tasks direct
to/from GPU memory spaces. Uses direct CUDA memory copies when thread
MPI is in use, otherwise CUDA-aware MPI. Uses existing mechanism to
perform PME reduction in CUDA F buffer ops function.
Implements part of #2817
Change-Id: I3bf934d20b9af94235532fb030e372af06328a52
Make the wait on PME GPU results conditional
When the PME forces are reduced on-GPU and no energy/virial output is
produced, we can avoid blocking waiting on the CPU for the PME GPU
taks to complete.
This however would break the timing accounting which needs to happen
after PME tasks completed. Hence the accounting is moved to the PME
output clearing.
Change-Id: I4e7f3aa43754a187fe5d6b584803444967516958
GPU Coordinate PME/PP Communications
Extends PmePpCommGpu class to provide PP-side support for coordinate
transfers from either GPU or CPU to PME task, and adds new
PmeCoordinateReceiverGpu class to recieve coordinate data directly to
the GPU on the PME task.
Implements part of #2817
Refs TODOs #3157 #3158 #3159
Change-Id: Iefa2bdfd9813282ad8b07feeb7691f16880e61a2
History
#1 Updated by Szilárd Páll about 2 years ago
- Description updated (diff)
#2 Updated by Szilárd Páll almost 2 years ago
- Description updated (diff)
- Status changed from New to Accepted
#3 Updated by Gerrit Code Review Bot almost 2 years ago
Gerrit received a related patchset '5' for Issue #2817.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: gromacs~master~Ib87dabd74a02727898681249691ac9786b8ac65c
Gerrit URL: https://gerrit.gromacs.org/9169
#4 Updated by Gerrit Code Review Bot almost 2 years ago
Gerrit received a related patchset '3' for Issue #2817.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: gromacs~master~I80c95438e44167b6a9d9d74c27709379f6665867
Gerrit URL: https://gerrit.gromacs.org/9170
#5 Updated by Alan Gray almost 2 years ago
Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.
#6 Updated by Gerrit Code Review Bot almost 2 years ago
Gerrit received a related patchset '1' for Issue #2817.
Uploader: Alan Gray (alang@nvidia.com)
Change-Id: gromacs~master~Ice984425301d24bac1340e883698244489cd686e
Gerrit URL: https://gerrit.gromacs.org/9275
#7 Updated by Szilárd Páll almost 2 years ago
Alan Gray wrote:
Regarding splitting this up: I am starting to work on a new patch which implements only the GPU Force buffer ops, without the PME reduction.
Have you considered also decoupling the GPU-side reduction with CPU forces? This would allow the change to go in independently from all other changes required. Also, if you keep the reduction, do consider changes I76e4b954b4ef045f299a8496b4975497720f4b89 (https://gerrit.gromacs.org/#/c/9126/) and Ie49c0fc483b274ac17e6ace9ca495c11dc719532 (must be a draft).
#8 Updated by Alan Gray almost 2 years ago
OK, thanks - I will think about this and adjust the new patch accordingly.
#9 Updated by Mark Abraham almost 2 years ago
I tried to add links to Szilard's text, but one of the patches is a draft
#10 Updated by Szilárd Páll almost 2 years ago
Mark Abraham wrote:
I tried to add links to Szilard's text, but one of the patches is a draft
Yes, this was one of the set of changes I uploaded as suggested improvements to the (then single) buffer ops change. The mini-branch of recommendations were kept as drafts and were never intended to undergo code review, so they would just be noise on the already noise gerrit site. If there is interest I can share them wider, but the content should anyway end up in new changes intended fr review.
#11 Updated by Alan Gray almost 2 years ago
F buffer ops (without PME reduction) patch now updated to include vectorization within kernel and use of haveCpuForces flag, which is currently always set to "true" such that H2D transfer is activated before buffer op, awaiting force workload patch for avaiability of (haveSpecialForces || haveCpuBondedWork).
I considered further splitting this to separate the buffer ops and reduction, but it doesn't really improve isolation because haveCpuForces would still be required to determine if an extra reduction across GPU forces and CPU forces is required.
Now awaiting review.
#12 Updated by Alan Gray almost 2 years ago
Now rebased such that (haveSpecialForces || haveCpuBondedWork) is available. But we can't use it just yet to make the H2D transfer conditional until the PME reduction is also integrated into the buffer ops. So still setting haveCpuForces=true for now.
#13 Updated by Szilárd Páll over 1 year ago
- on virial steps PME forced are needed separately
- for DD separate short-range forces are needed separately
For this reason, I suggest to add in a separate change, a code-path/kernel flavor that does out of place short-/long-range force reduction. It would be good to have a child F buf ops redmine too track technical issues with the feature.
#14 Updated by Szilárd Páll over 1 year ago
- on virial steps for now turn off GPU buffer ops instead of shuffling around data to cater for the CPU-side virial reduction
- next, port the virial reduction to the GPU (likely best called from / fused with the reduction kernel); see current code in sim_util.cpp:calc_virial().
#15 Updated by Szilárd Páll over 1 year ago
- Status changed from Accepted to In Progress
- Target version set to 2020-beta1
#16 Updated by Szilárd Páll over 1 year ago
- Description updated (diff)
#17 Updated by Alan Gray over 1 year ago
- Description updated (diff)
#18 Updated by Mark Abraham over 1 year ago
- Target version changed from 2020-beta1 to 2020-beta2
#19 Updated by Szilárd Páll over 1 year ago
- Related to Task #3171: schedule CPU H2D force contribution in separate stream added
#20 Updated by Paul Bauer about 1 year ago
- Target version changed from 2020-beta2 to 2020-beta3
bump
#21 Updated by Paul Bauer about 1 year ago
- Target version changed from 2020-beta3 to 2020-rc1
how much of this is done? Still realistic for 2020?
#22 Updated by Paul Bauer about 1 year ago
- Target version changed from 2020-rc1 to 2021-infrastructure-stable
apparently not realistic for 2020
#23 Updated by Alan Gray 12 months ago
- Status changed from In Progress to Closed
Moved outstanding issues to umbrella task https://redmine.gromacs.org/issues/3370
#24 Updated by Szilárd Páll 11 months ago
- the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
- consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
Have these points been moved somewhere or considered and somehow ruled out?
Related TODOs:
- Ideally the force-reduction should not be called from a method of the nonbonded module (especially due to the complexities of CPU/GPU code-paths) - consider reorganizing reductions
Has this been moved elsewhere?
#25 Updated by Alan Gray 11 months ago
I had already moved the TODO into https://redmine.gromacs.org/issues/3370 ("Force buffer op and reduction cleanup/improvement" section) - I've now also pasted the other points there too. These are still relevant and hadn't been forgotten, on Monday Artem and I actually discussed re-working of the force buffer ops and the need for a new purpose built object that can perform reductions on an arbitrary number of input buffers, wait on an arbitrary number of events etc. There is an open question as to whether this object should also handle the transform or that should be separate. The latter has the disadvantage of a separate kernel call, unless we go for the "on-the-fly" option which might be the best solution. Your thoughts are very welcome on this.
Position buffer ops in CUDA
TODO:
- improve the CUDA kernel
Note:
- waits for X copy on the PME stream to finish, need to implement sync
point between PME and NB streams (in follow-up).
Implements part of #2817
Change-Id: Ib87dabd74a02727898681249691ac9786b8ac65c