Project

General

Profile

Feature #3160

Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication

Feature #2891: PME/PP GPU communications

implement direct comm for different src/target memory spaces

Added by Szilárd Páll about 1 month ago. Updated 9 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Instead of always copying data to the GPU and doing GPU<->GPU direct comm, implement use cases where data resides on the CPU and is transferred asa direct CPU->GPU or GPU->CPU copy; specifically:
  • extend to support case where PME is on CPU and PP is on GPU.
  • extend to case where the force reduction is the CPU and a PME rank uses GPU.

This would simplify code and likely be more performant too.

Associated revisions

Revision 6ecc09b0 (diff)
Added by Artem Zhmurov 23 days ago

Add haveCpuLocalForces flag to DomainLifetimeWorkload

The flag haveCpuLocalForces is moved to DomainLifetimeWorkload.
The name and meaning is also slightly changed, haveCpuLocalForceWork
now now indicates whether any forces in the local domain are computed
on the CPU.
Consequently, the currently redundant PME f reduction conditional has
been removed from the haveCpuLocalForceWork initialization. The
relationship between local force work and existence of local forces on
the CPU however needs to be clarified in relationship to #3160 and how
the PP-PME direct communication is improved/simplified.

Change-Id: Idaed4da41bf5b72435c47bab8aeb7130b36b03c7

History

#1 Updated by Szilárd Páll 29 days ago

  • Priority changed from Normal to High

#2 Updated by Alan Gray 19 days ago

  • Description updated (diff)

extend to support case where PME is on CPU and PP is on GPU.

Have code offline that almost works, just failing one local freeenergy test which I'm looking into.

extend to case where the force reduction is the CPU and a PME rank uses GPU.

This already works in master branch and is used for search and virial steps.

#3 Updated by Alan Gray 19 days ago

  • Status changed from New to In Progress
  • Target version changed from 2020 to 2020-beta3

#4 Updated by Alan Gray 19 days ago

extend to support case where PME is on CPU and PP is on GPU.

Have code offline that almost works, just failing one local freeenergy test which I'm looking into.

Now passing tests locally with -nt 2 -npme 1, but have hit more issues with -nt 4 -npme 1. It's really tricky to get this working in all cases. E.g. for nbnxn-ljpme-LB it hangs, because each the PP task we have

        if (flags & PP_PME_COORD)
        {
            if (reinitGpuPmePpComms)
            {
                fr->pmePpCommGpu->reinit(n);
            }
        }

but for this test case 1 out of the 3 PP tasks (flags & PP_PME_COORD)==false while the other 2 have it as true (so the false case causes a hang when the PME task tries to communicate with it). Why would that be the case?

#5 Updated by Szilárd Páll 18 days ago

Do you mean that one of the ranks never calls gmx_pme_send_coordinates()? I don't see in the code where could such a conditionality arise.

#7 Updated by Alan Gray 9 days ago

Sorry, what I said above wan't quite right. One PP task wasn't taking the path not because of the pasted condition but instead due to another if statement, because the number of atoms to send (n) was equal to zero for that task. I refactored to fix the issue. Proposed change supporting PME on CPU is now at
https://gerrit.gromacs.org/c/gromacs/+/14223

Also available in: Atom PDF