This issue is to track testing / evaluating usability / performance of the CUDA-aware MPI and direct copy implementations of multi-GPU communications (when does it work, when does it not, when is it faster, etc.).
#2 Updated by Alan Gray about 2 months ago
Latest performance results for new features on 4-GPU servers:
Al results in ns/day
Volta NVLink: 4xV100-SXM2+2xBroadwell
Volta PCIe: 4xV100-PCIe+2xHaswell
Pascal PCIe: 4xP100-PCIe+2xHaswell
STMV: 1,066,628 atoms
Cellulose: 408,609 atoms
ADHD: 95,561 atoms
Code version: https://gerrit.gromacs.org/c/gromacs/+/14402 (with debug print statement commented out).
[export GMX_USE_GPU_BUFFER_OPS=1] (for all except "Default")
[export GMX_GPU_DD_COMMS=1] (for "Halo")
[export GMX_GPU_PME_PP_COMMS=1] (for "PME-PP")
gmx mdrun -s topol.tpr -ntomp $OMP_NUM_THREADS -pme gpu -nb gpu -ntmpi 4 -npme 1 -nsteps 10000 -resethway -v -notunepme -pin on -bonded gpu -noconfout -gpu_id 0123 -nstlist 200 \
[-update gpu] (for "Update")