Project

General

Profile

Feature #2891

Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms

PME/PP GPU communications

Added by Alan Gray 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

When utilizing multiple GPUs with a dedicated PME GPU, data must be exchanged between the PME task and the PP tasks. The position buffer is gathered to the PME task from the PP task before the PME operation, and the force array is scattered from the PME task to the PP tasks after the operation. Currently, this is routed through the host CPUs, with PCIe transfers and MPI calls operating on data in CPU memory.

Instead, we can transfer data directly between GPU memory spaces using GPU peer-to-peer communication. Modern MPI implementations are CUDA-aware and support this.


Related issues

Related to GROMACS - Feature #2915: GPU direct communicationsNew

History

#1 Updated by Alan Gray 4 months ago

  • Description updated (diff)

#2 Updated by Alan Gray 4 months ago

Awaiting merge of buffer ops patch.

#3 Updated by Gerrit Code Review Bot 4 months ago

Gerrit received a related patchset '1' for Issue #2891.
Uploader: Alan Gray ()
Change-Id: gromacs~master~If6222eccfe30099beeb25a64cceb318d0a3b1dbc
Gerrit URL: https://gerrit.gromacs.org/9385

#4 Updated by Szilárd Páll 4 months ago

  • Category set to mdrun

Just had a look at the proposed change and I think we should perhaps take the time to discuss some implementation choices here. There apply to all direct GPU communication you are working on, so it may make sense to open a new issue where such general things are discussed?

A few questions to kick off with:
  • How do we provide fallbacks for when i) no MPI is used ii) no CUDA-aware MPI is used?
    - For the former, with tMPI I assume we can have a GPUDirect-based fallback.
    - For the latter, how do we detect that we have a CUDA-aware MPI? What happens if we don't and the proposed code is invoked?
  • As noted in CR, we should initiate the PP->PME send exactly at the same location where the CPU path does it; the coordinates are available there so there seems to be no reason to not unify the paths.

#5 Updated by Alan Gray 4 months ago

As noted in CR, we should initiate the PP->PME send exactly at the same location where the CPU path does it; the coordinates are available there so there seems to be no reason to not unify the paths.

Yes, agreed.

Moving other Q to new issue 2915

#6 Updated by Szilárd Páll 4 months ago

Also available in: Atom PDF