Project

General

Profile

Feature #2816

Device-side update&constraits, buffer ops and multi-gpu comms

Added by Alan Gray 5 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Difficulty:
hard
Close

Description

Gromacs is sub-optimal on modern GPU Servers.

When running on a single GPU, all force calculations are now done on the device, but the buffer operations plus update & constraints are done on the host, and repeated PCI-e transfers are required. Such CPU computation and PCI-e communication comprise an increasingly significant overhead as the performance of the GPU continues to increase with each subsequent generation.

On multi-GPU the situation is ever worse because the required multi-GPU communications are routed through the CPU.

NVIDIA have developed prototype code with all compute and communication parts now device-side, with coordinate and force PCIe transfers removed for regular timesteps. Gerrit patch 8506 introduces device-side buffer ops, and patch 8859 (based on the buffer ops patch) demonstrates the remainder of the new developments:

  • GPU Update and Constraints
  • Device MPI: PME/PP Gather and Scatter
    - Relatively straightforward solution using CUDA-Aware MPI
  • Device MPI: PP local/nonlocal exchanges
    - New functionality to pack device-buffers and exchange using CUDA-aware MPI
    - Similar D2D exchanges also for Constraints Lincs part

See the attached slides for more info.

These developments show major performance improvements, but are still in prototype form, and the purpose of this issue is to track the work required to integrate properly into the master branch.

NVDevUpdate21Dec18.pdf (1.21 MB) NVDevUpdate21Dec18.pdf Slides describing NVIDIA developments Alan Gray, 12/21/2018 11:15 AM

Subtasks

Feature #2817: GPU X/F buffer opsAccepted
Feature #2934: GPU X Buffer opsNew
Feature #2885: CUDA version of LINCSNewArtem Zhmurov
Feature #2886: CUDA version of SETTLENewArtem Zhmurov
Feature #2887: CUDA version of Leap Frog algorithmNewArtem Zhmurov
Feature #2888: CUDA Update and Constraints moduleNewArtem Zhmurov
Feature #2890: GPU Halo ExchangeNew
Feature #2891: PME/PP GPU communications New
Feature #2915: GPU direct communicationsNew

Associated revisions

Revision bec0fa7b (diff)
Added by Artem Zhmurov 3 months ago

Test for LINCS and SHAKE constraints.

This version updates the tests making the selection of the
constraining algorithm more abstract. Makes it possible
to use the same test routines for new implementations (e.g.
CPU- or GPU-based) or (and) algorithms (e.g. LINCS or SHAKE).
Partly this is preparation for the GPU-based version of
the constraints (Refs #2816).

Change-Id: Ice7dfdcc6d86c04656b0a1dd4e328c5afdb8a263

Revision 0a1aae78 (diff)
Added by Artem Zhmurov 22 days ago

CUDA version of LINCS constraints.

Implementation of the LINCS constraints for NVIDIA GPUs.
Currently works isolated from the other parts of the code:
coordinates and velocities are copied to and from GPU on
every integration timestep. Part of the GPU-only loop.
Loosely based on change 9162 by Alan Gray. To enable,
set the environmental variable GMX_LINCS_GPU.

Limitations:
1. Works only if the constraints can be split in short
uncoupled groups (currently < 256, designed for H-bonds
constraints).
2. Does not change the matrix inversion order for costraints
triangles.
3. Does not support free energy computations.
4. Assumes no communications between domains (i.e. assumes that
there is no constraints connecting atoms from two different
domains).
5. Number of thread per blocks should be a power of 2 for
reduction of virial to work.

TODOs:
1. Move more data from the global memory to local.
2. Change .at() to []
3. Add sorting by the number of coupled constraints to decrease
warp divergencies.
4. numAtoms should be changeable (for multi-GPU case).

Refs #2816, #2885

Change-Id: I3c975cf898053b7467bcd30459e60ce2c8852be6

Revision 02a92f23 (diff)
Added by Artem Zhmurov 20 days ago

CUDA version of SETTLE algorithm with basic tests

CUDA-based GPU implementation of SETTLE. This is a part of
all-GPU loop. Can work isolated from other parts of the code
since coordinates are copied to (from) device before (after)
SETTLE kernel call. The velocity update as well as virial
evaluations can be enabled.

To enable, set GMX_SETTLE_GPU environment variable.

Limitations:
1. Does not work when domain decomposition is enabled.
2. Projection of the derivative is not implemented.
3. Not fully integrated/unified with the CPU version.

TODOs:
1. Multi-GPU case.
2. Better virial reduction. This is a more general feature,
not only related to constraints.
5. More cleanup in constr.cpp needed.
6. Better unit tests.

Refs #2816, #2886

Change-Id: I218e1bf1f86a2351e189e3c27f950f45c06135a4

Revision d061dec5 (diff)
Added by Artem Zhmurov 8 days ago

CUDA version of Leap-Frog integrator with basic tests

Part of the GPU-only loop. Curent version is as a stand-alone module,
with its own coordinate, velocities and forces data management.
To activate, set environment variable GMX_INTEGRATE_GPU.

Limitations:

-- Only basic Leap-Frog is implemented.
-- No temperature control.
-- No pressure control.

Refs #2816, #2887

Change-Id: I439d7f5fd4f69a17ca7aaa412e242ce5e3aa5dbd

History

#1 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '26' for Issue #2816.
Uploader: Artem Zhmurov ()
Change-Id: gromacs~master~Ice7dfdcc6d86c04656b0a1dd4e328c5afdb8a263
Gerrit URL: https://gerrit.gromacs.org/8982

#2 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related DRAFT patchset '1' for Issue #2816.
Uploader: Artem Zhmurov ()
Change-Id: gromacs~master~I3c975cf898053b7467bcd30459e60ce2c8852be6
Gerrit URL: https://gerrit.gromacs.org/9193

#3 Updated by Alan Gray 3 months ago

I want to add a subtask here for "GPU Halo exchange", but can't see a way to do it. Are special permissions required?

#4 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '2' for Issue #2816.
Uploader: Alan Gray ()
Change-Id: gromacs~master~I8e6473481ad4d943df78d7019681bfa821bd5798
Gerrit URL: https://gerrit.gromacs.org/9225

#5 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related DRAFT patchset '1' for Issue #2816.
Uploader: Artem Zhmurov ()
Change-Id: gromacs~master~I218e1bf1f86a2351e189e3c27f950f45c06135a4
Gerrit URL: https://gerrit.gromacs.org/9244

#6 Updated by Gerrit Code Review Bot 2 months ago

Gerrit received a related DRAFT patchset '4' for Issue #2816.
Uploader: Artem Zhmurov ()
Change-Id: gromacs~master~I439d7f5fd4f69a17ca7aaa412e242ce5e3aa5dbd
Gerrit URL: https://gerrit.gromacs.org/9272

#7 Updated by Szilárd Páll 2 months ago

We need to decouple these changes; there are several distinct features that are proposed here, so we need redmine issues for those. I would also prefer to organize trees of issues around a certain target feature-set, e.g. single-GPU no-DD all offloaded, or multi-GPU with-DD, most offloaded, etc. While feature sets may overlap, the higher-level features are these parallelization functionalities that will depend/be related to both common and individual tasks.
Consequently, at least a separate LINCS, SETTLE, Update, halo exchange, and PP-PME comm issues would be desirable, possibly even separate ones for with/without communication (when this makes sense).

#8 Updated by Alan Gray 2 months ago

Yes. I already tried to create a sub-task here for halo exchange, but couldn't see how to do it. Could you let me know how you did it for the "GPU X/F Buffer Ops" task? It may be a permissions thing.

#9 Updated by Artem Zhmurov 2 months ago

I've created blank features for the GPU-only loop. Will start filling them up .

#10 Updated by Gerrit Code Review Bot about 2 months ago

Gerrit received a related patchset '4' for Issue #2816.
Uploader: Artem Zhmurov ()
Change-Id: gromacs~master~I8730aad0ecaa0230686fe89d1157b0da2f01f7bc
Gerrit URL: https://gerrit.gromacs.org/9329

#11 Updated by Gerrit Code Review Bot about 2 months ago

Gerrit received a related DRAFT patchset '2' for Issue #2816.
Uploader: Artem Zhmurov ()
Change-Id: gromacs~master~I4c65a6c7088fd8059f4e7fa3cb4637cb2af79ebc
Gerrit URL: https://gerrit.gromacs.org/9349

Also available in: Atom PDF