Device-side update&constraits, buffer ops and multi-gpu comms
Gromacs is sub-optimal on modern GPU Servers.
When running on a single GPU, all force calculations are now done on the device, but the buffer operations plus update & constraints are done on the host, and repeated PCI-e transfers are required. Such CPU computation and PCI-e communication comprise an increasingly significant overhead as the performance of the GPU continues to increase with each subsequent generation.
On multi-GPU the situation is ever worse because the required multi-GPU communications are routed through the CPU.
NVIDIA have developed prototype code with all compute and communication parts now device-side, with coordinate and force PCIe transfers removed for regular timesteps. Gerrit patch 8506 introduces device-side buffer ops, and patch 8859 (based on the buffer ops patch) demonstrates the remainder of the new developments:
- GPU Update and Constraints
- Device MPI: PME/PP Gather and Scatter
- Relatively straightforward solution using CUDA-Aware MPI
- Device MPI: PP local/nonlocal exchanges
- New functionality to pack device-buffers and exchange using CUDA-aware MPI
- Similar D2D exchanges also for Constraints Lincs part
See the attached slides for more info.
These developments show major performance improvements, but are still in prototype form, and the purpose of this issue is to track the work required to integrate properly into the master branch.