Task #2965
Task #3370: Further improvements to GPU Buffer Ops and Comms
Performance of GPU direct communications
Description
This issue is to track testing / evaluating usability / performance of the CUDA-aware MPI and direct copy implementations of multi-GPU communications (when does it work, when does it not, when is it faster, etc.).
History
#1 Updated by Alan Gray almost 2 years ago
- Target version set to 2020
#2 Updated by Alan Gray over 1 year ago
- File Capture.JPG Capture.JPG added
Latest performance results for new features on 4-GPU servers:
Al results in ns/day
Volta NVLink: 4xV100-SXM2+2xBroadwell
Volta PCIe: 4xV100-PCIe+2xHaswell
Pascal PCIe: 4xP100-PCIe+2xHaswell
STMV: 1,066,628 atoms
Cellulose: 408,609 atoms
ADHD: 95,561 atoms
Code version: https://gerrit.gromacs.org/c/gromacs/+/14402 (with debug print statement commented out).
[export GMX_USE_GPU_BUFFER_OPS=1] (for all except "Default")
[export GMX_GPU_DD_COMMS=1] (for "Halo")
[export GMX_GPU_PME_PP_COMMS=1] (for "PME-PP")
gmx mdrun -s topol.tpr -ntomp $OMP_NUM_THREADS -pme gpu -nb gpu -ntmpi 4 -npme 1 -nsteps 10000 -resethway -v -notunepme -pin on -bonded gpu -noconfout -gpu_id 0123 -nstlist 200 \
[-update gpu] (for "Update")
#3 Updated by Paul Bauer about 1 year ago
- Target version changed from 2020 to 2021
#4 Updated by Alan Gray about 1 year ago
- Status changed from New to In Progress
- Parent task changed from #2915 to #3370