Project

General

Profile

Feature #3126

State propagator GPU data manager

Added by Artem Zhmurov about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

To centralize the management device buffer, a separate object is needed. This object will keep the device buffers, with corresponding getters, should be able to (re)allocate the device memory for them, perform H2D and D2H copies. The basic functionality includes:

- (Re)initialization of buffers. Note that the PME requires the coordinates buffer to have cleared padding.
- Get the (pointers-to-)buffers.
- Performing H2D and D2H copy in a designated stream.
- Support for CUDA and OpenCL builds.

TODOs:
- The streams in which copy is executed should depend on the locality (https://gerrit.gromacs.org/#/c/gromacs/+/13484/ and https://gerrit.gromacs.org/#/c/gromacs/+/13562/).
- The same state propagator data manager is used by PME/PP ranks and by PME only rank. Currently, this is done by setting the unused device streams to nullptr, force and velocities buffers are kept and reallocated, but not used. To make this more safe and memory efficient, the class should be templated depending on the usage scenarios. The unused resources should not be allocated, the calls to them should issue an error.
- The OpenCL limitations should be properly documented. The use of the functionality that does not work with OpenCL should be asserted with describing message.
- The object should have two flavors of copy functions: one that marks an event upon execution, one that does not.
- Method to clear the forces buffer (https://gerrit.gromacs.org/#/c/gromacs/+/13358/).
- Add management for the event that the coordinates are ready on the GPU before the MD step. The event depends on the offload scenario, since the coordinates are either copied from host or updated in update-constraints. The getter for th appropriate event synchronizer should be added. The event-based synchronization should be introduced in the consumer.
- The event-based synchronization before the force reduction (https://gerrit.gromacs.org/#/c/gromacs/+/13617/).
- If update is offloaded, the event-based synchronization after the force reduction should also be introduced, or it should be made in the reduction in update stream.

Associated revisions

Revision 8a0d4d97 (diff)
Added by Szilárd Páll about 1 month ago

Enable StatePropagatorGpuData for force transfers

Force transfers have been switched to use StatePropagatorGpuData already
before. This change updates the synchronization mechanisms as:
- replaces the previous stream sync after GPU buffer/ops reduction with
a waitForcesReadyOnHost call;
- removes the barriers in copyForces[From|To]Gpu() as dependencies
are now satisfied: most dependencies are intra-stream and therefore
implicit, the exception being the halo exchange that uses its own
mechanism to sync H2D in the local stream with the nonlocal stream
(which is yet to be replaces Refs #3093).

Refs. #3126.

Change-Id: I8bfd39f79c87f20492c4ae287d6f19261724f806

Revision a7759162 (diff)
Added by Artem Zhmurov about 1 month ago

Add separate constructor to StatePropagatorDataGpu for PME-only rank / PME tests

A separate constructor is added to the StatePropagatorDataGpu to use in the
separate PME rank and in PME tests. These use the provided stream to copy
coordinates for atom with Local or All localities. Copy of coordinates for
non-local particles as well as copy operations for the forces and velocities
are not allowed by assertions.

Refs. #3126.

Change-Id: I66aeeaea54931398b1a4a30b920b092f7d40ae16

Revision 13f5fac2 (diff)
Added by Szilárd Páll about 1 month ago

Link GPU coordinate producer and consumer tasks

The event synchronizer indicating that coordinates are ready in the GPU
is now passed to the two tasks that depend on this input: PME and
X buffer ops. Both enqueue a wait on the passed event prior to kernel
launch to ensure that the coordinates are ready before the kernels
start executing.

On the separate PME ranks and in tests, as we use a single stream,
no synchronization is necessary.

With the on-device sync in place, this change also removes the
streamSynchronize call from copyCoordinatesToGpu.

Refs. #2816, #3126.

Change-Id: I3457f01f44ca6d6ad08e0118d8b1def2ab0b381b

Revision f310be38 (diff)
Added by Szilárd Páll about 1 month ago

Trigger synchronizer when local forces are ready

The sycnhronizer is created and managed in StatePropagatorDataGpu and is
passed to the nonbonded mdoule at the f buffer ops init.

Refs #2888 #3126

Change-Id: Ie9bf0b6cd8511fe282e377e48f3940e591db214c

Revision 7bbfb57c (diff)
Added by Artem Zhmurov about 1 month ago

Link GPU force producer and consumer tasks

The GPU event synchronizer that indicates that forces are ready
for a consumption is now passed to the GPU update-constraints.
The update-constraints enqueue a wait on the event in the update
stream before performing numerical integration and constraining.
Note that the event is conditionally returned by the
StatePropagatorDataGpu and indicates that either the reduction of
forces on the GPU or the H2D copy is done, depending on offload
scenario on a current timestep.

Refs. #2816, #2888, #3126.

Change-Id: Ic12b0c55b75ec5f0c31ce500a2760fb4d5cf3b91

History

#1 Updated by Artem Zhmurov about 1 month ago

  • Description updated (diff)

#2 Updated by Artem Zhmurov about 1 month ago

  • Description updated (diff)

Also available in: Atom PDF