When there are CPU tasks with GPU update, the x D2H should overlap with the GPU force compute. Due to the simplistic safety measures put in place to keep the StatePropagatorGpu code maintainable, we still launch transfers and wait back-to-back preventing any overlap of the transfer.
This makes the x D2H a pure overhead on in GPU update runs, and to avoid that we should move the waits to just before the CPU tasks that consume x (preferably after we have the buffer state tracking in StatePropagatorGpu so we don't have to take the overhead of multiple CUDA wait calls).
Allow x D2H to overlap with GPU force compute
With GPU update coordinates are transferred back to the CPU every step
if there are forces to compute on the CPU. Originally this was
implemented with a back-to-back transfer launch and wait at the
beginning of do_force().
This change moves the CPU wait for the completion of the coordinate
transfer closer to the consumer tasks in order to avoid blocking GPU
force tasks' launch and allowing compute and transfer to overlap.