Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms
PME/PP GPU communications
When utilizing multiple GPUs with a dedicated PME GPU, data must be exchanged between the PME task and the PP tasks. The position buffer is gathered to the PME task from the PP task before the PME operation, and the force array is scattered from the PME task to the PP tasks after the operation. Currently, this is routed through the host CPUs, with PCIe transfers and MPI calls operating on data in CPU memory.
Instead, we can transfer data directly between GPU memory spaces using GPU peer-to-peer communication. Modern MPI implementations are CUDA-aware and support this.
TODO extend to support case where PME is on CPU and PP is on GPU.
TODO extend to case where the force reduction is the CPU and a PME rank uses GPU.
PME/PP GPU Comms for position buffer
Activate with GMX_GPU_PME_PP_COMMS env variable
Performs gather of position buffer data from PP tasks to PME task with
transfers operating directly to/from GPU memory. Uses direct CUDA memory
copies when thread MPI is in use, otherwise CUDA-aware MPI.
Implements part of Feature #2891
#4 Updated by Szilárd Páll 6 months ago
- Category set to mdrun
Just had a look at the proposed change and I think we should perhaps take the time to discuss some implementation choices here. There apply to all direct GPU communication you are working on, so it may make sense to open a new issue where such general things are discussed?A few questions to kick off with:
- How do we provide fallbacks for when i) no MPI is used ii) no CUDA-aware MPI is used?
- For the former, with tMPI I assume we can have a GPUDirect-based fallback.
- For the latter, how do we detect that we have a CUDA-aware MPI? What happens if we don't and the proposed code is invoked?
- As noted in CR, we should initiate the PP->PME send exactly at the same location where the CPU path does it; the coordinates are available there so there seems to be no reason to not unify the paths.