Project

General

Profile

Feature #2891

Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication

PME/PP GPU communications

Added by Alan Gray 7 months ago. Updated 2 days ago.

Status:
New
Priority:
High
Assignee:
-
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

When utilizing multiple GPUs with a dedicated PME GPU, data must be exchanged between the PME task and the PP tasks. The position buffer is gathered to the PME task from the PP task before the PME operation, and the force array is scattered from the PME task to the PP tasks after the operation. Currently, this is routed through the host CPUs, with PCIe transfers and MPI calls operating on data in CPU memory.

Instead, we can transfer data directly between GPU memory spaces using GPU peer-to-peer communication. Modern MPI implementations are CUDA-aware and support this.

TODO extend to support case where PME is on CPU and PP is on GPU.
TODO extend to case where the force reduction is the CPU and a PME rank uses GPU.


Subtasks

Task #3077: PME/PP GPU Comms unique pointer deletion causes seg fault when CUDA calls exist in destructorNew
Task #3105: implement GPU PME/PP comm cycle countingNew
Task #3157: separate PME x receive syncNewAlan Gray
Task #3158: use MPI_Isend() in sendFToPpCudaDirect and receiveCoordinatesFromPpCudaDirectNewAlan Gray
Task #3159: eliminate regression due to moving gmx_pme_send_coordinates()NewAlan Gray
Feature #3160: implement direct comm for different src/target memory spacesNewAlan Gray
Bug #3164: mdrun-mpi-test with separate PME ranks and PP-PME CPU comm crashesNew

Related issues

Related to GROMACS - Feature #2915: GPU direct communicationsNew
Related to GROMACS - Feature #3087: enable GPU peer to peer accessNew
Related to GROMACS - Bug #3165: task assignment silent abortNew

Associated revisions

Revision 4dd80128 (diff)
Added by Alan Gray 2 months ago

PME/PP GPU Comms for position buffer

Activate with GMX_GPU_PME_PP_COMMS env variable

Performs gather of position buffer data from PP tasks to PME task with
transfers operating directly to/from GPU memory. Uses direct CUDA memory
copies when thread MPI is in use, otherwise CUDA-aware MPI.

Implements part of Feature #2891

Change-Id: If6222eccfe30099beeb25a64cceb318d0a3b1dbc

Revision ec0aa356 (diff)
Added by Alan Gray 4 days ago

PME/PP GPU Pimpl Class and GPU->CPU Communication for force buffer

Activate with GMX_GPU_PME_PP_COMMS env variable

Implements new pimpl class for PME-PP GPU communications. Performs
scatter of force buffer data from PME task GPU buffer to PP task CPU
buffers directly using CUDA memory copies. Requires thread MPI to be
in use.

Implements part of #2891

Change-Id: I0181ff67065c75f20cddc361f695df9bf888cd88

History

#1 Updated by Alan Gray 7 months ago

  • Description updated (diff)

#2 Updated by Alan Gray 7 months ago

Awaiting merge of buffer ops patch.

#3 Updated by Gerrit Code Review Bot 7 months ago

Gerrit received a related patchset '1' for Issue #2891.
Uploader: Alan Gray ()
Change-Id: gromacs~master~If6222eccfe30099beeb25a64cceb318d0a3b1dbc
Gerrit URL: https://gerrit.gromacs.org/9385

#4 Updated by Szilárd Páll 7 months ago

  • Category set to mdrun

Just had a look at the proposed change and I think we should perhaps take the time to discuss some implementation choices here. There apply to all direct GPU communication you are working on, so it may make sense to open a new issue where such general things are discussed?

A few questions to kick off with:
  • How do we provide fallbacks for when i) no MPI is used ii) no CUDA-aware MPI is used?
    - For the former, with tMPI I assume we can have a GPUDirect-based fallback.
    - For the latter, how do we detect that we have a CUDA-aware MPI? What happens if we don't and the proposed code is invoked?
  • As noted in CR, we should initiate the PP->PME send exactly at the same location where the CPU path does it; the coordinates are available there so there seems to be no reason to not unify the paths.

#5 Updated by Alan Gray 7 months ago

As noted in CR, we should initiate the PP->PME send exactly at the same location where the CPU path does it; the coordinates are available there so there seems to be no reason to not unify the paths.

Yes, agreed.

Moving other Q to new issue 2915

#6 Updated by Szilárd Páll 7 months ago

#7 Updated by Szilárd Páll about 1 month ago

#8 Updated by Alan Gray 30 days ago

  • Description updated (diff)

#9 Updated by Szilárd Páll 30 days ago

> TODO extend to support case where PME is on CPU and PP is on GPU.

Similar functionality would be needed if we'd want to support PME mixed mode on a separate PME rank.

#10 Updated by Szilárd Páll 10 days ago

  • Description updated (diff)

#11 Updated by Szilárd Páll 4 days ago

The code implementing this feature was broken by:
https://gerrit.gromacs.org/c/gromacs/+/13437
This needs fixing before we can use unit tests to run this feature at all.

#12 Updated by Artem Zhmurov 3 days ago

Szilárd Páll wrote:

The code implementing this feature was broken by:
https://gerrit.gromacs.org/c/gromacs/+/13437
This needs fixing before we can use unit tests to run this feature at all.

It was revealed by the aforementioned change to the tests, not broken by it.

#13 Updated by Szilárd Páll 2 days ago

  • Description updated (diff)

#14 Updated by Szilárd Páll about 18 hours ago

  • Related to Bug #3165: task assignment silent abort added

Also available in: Atom PDF