State propagator GPU data manager
To centralize the management device buffer, a separate object is needed. This object will keep the device buffers, with corresponding getters, should be able to (re)allocate the device memory for them, perform H2D and D2H copies. The basic functionality includes:
- (Re)initialization of buffers. Note that the PME requires the coordinates buffer to have cleared padding.
- Get the (pointers-to-)buffers.
- Performing H2D and D2H copy in a designated stream.
- Support for CUDA and OpenCL builds.
- The streams in which copy is executed should depend on the locality (https://gerrit.gromacs.org/#/c/gromacs/+/13484/ and https://gerrit.gromacs.org/#/c/gromacs/+/13562/).
- The same state propagator data manager is used by PME/PP ranks and by PME only rank. Currently, this is done by setting the unused device streams to nullptr, force and velocities buffers are kept and reallocated, but not used. To make this more safe and memory efficient, the class should be templated depending on the usage scenarios. The unused resources should not be allocated, the calls to them should issue an error.
- The OpenCL limitations should be properly documented. The use of the functionality that does not work with OpenCL should be asserted with describing message.
- The object should have two flavors of copy functions: one that marks an event upon execution, one that does not.
- Method to clear the forces buffer (https://gerrit.gromacs.org/#/c/gromacs/+/13358/).
- Add management for the event that the coordinates are ready on the GPU before the MD step. The event depends on the offload scenario, since the coordinates are either copied from host or updated in update-constraints. The getter for th appropriate event synchronizer should be added. The event-based synchronization should be introduced in the consumer.
- The event-based synchronization before the force reduction (https://gerrit.gromacs.org/#/c/gromacs/+/13617/).
- If update is offloaded, the event-based synchronization after the force reduction should also be introduced, or it should be made in the reduction in update stream.
Enable StatePropagatorGpuData for force transfers
Force transfers have been switched to use StatePropagatorGpuData already
before. This change updates the synchronization mechanisms as:
- replaces the previous stream sync after GPU buffer/ops reduction with
a waitForcesReadyOnHost call;
- removes the barriers in copyForces[From|To]Gpu() as dependencies
are now satisfied: most dependencies are intra-stream and therefore
implicit, the exception being the halo exchange that uses its own
mechanism to sync H2D in the local stream with the nonlocal stream
(which is yet to be replaces Refs #3093).
Add separate constructor to StatePropagatorDataGpu for PME-only rank / PME tests
A separate constructor is added to the StatePropagatorDataGpu to use in the
separate PME rank and in PME tests. These use the provided stream to copy
coordinates for atom with Local or All localities. Copy of coordinates for
non-local particles as well as copy operations for the forces and velocities
are not allowed by assertions.
Link GPU coordinate producer and consumer tasks
The event synchronizer indicating that coordinates are ready in the GPU
is now passed to the two tasks that depend on this input: PME and
X buffer ops. Both enqueue a wait on the passed event prior to kernel
launch to ensure that the coordinates are ready before the kernels
On the separate PME ranks and in tests, as we use a single stream,
no synchronization is necessary.
With the on-device sync in place, this change also removes the
streamSynchronize call from copyCoordinatesToGpu.
Link GPU force producer and consumer tasks
The GPU event synchronizer that indicates that forces are ready
for a consumption is now passed to the GPU update-constraints.
The update-constraints enqueue a wait on the event in the update
stream before performing numerical integration and constraining.
Note that the event is conditionally returned by the
StatePropagatorDataGpu and indicates that either the reduction of
forces on the GPU or the H2D copy is done, depending on offload
scenario on a current timestep.