As agreed the data dependency sychronization should be implemented on the consumer task's end which is PME spread in the case of PME. PME-only ranks have the receive enqueue wait as soon as MPI returns. Consider assembling a list of events and passed to spread instead.
Consider whether having to receive from multiple PP ranks actually makes is more beneficial to overlap some receive with event wait enqueue.
GPU Coordinate PME/PP Communications
Extends PmePpCommGpu class to provide PP-side support for coordinate
transfers from either GPU or CPU to PME task, and adds new
PmeCoordinateReceiverGpu class to recieve coordinate data directly to
the GPU on the PME task.