Project

General

Profile

Task #2677

Task #2675: bonded CUDA offload task

bonded task scheduling

Added by Szilárd Páll 9 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Difficulty:
hard
Close

Description

The bonded kernels in the first version should be scheduled after the nonlocal coordinates arrive to the GPU; the task should make sure to wait for both local and nonlocal coordinates to be ready.

Note that the the nbnxn force array transfer will need a stream dependency on the bonded kernels' completion (conditional on whether the bondeds are offloaded).

Kernel priority TBD:
- with DD: high prio because there is no local/non-local split and dd_move_f() depends on all bondeds
- w/o DD: (likely also) high prio given that otherwise it might get preempted by PME and the chain of kernel (if not fused) get delayed later by the long nonbonded kernel.

Associated revisions

Revision 8be68409 (diff)
Added by Szilárd Páll 7 months ago

Split nbnxn input copy and kernel launch

The nonbonded x+q host-to-device copy and kernel launch is split into
two functions and called separately from do_force().
This will allow improving the bonded scheduling and better expressing a
missing bonded dependency (and fixing the related bug).

This change only moves code.

Refs #2677 #2786

Change-Id: Ie50e6a6b664f8400274b2f409eacb6c36f0908ba

Revision a01e1374 (diff)
Added by Szilárd Páll 7 months ago

Swap the order of GPU bonded and nonbonded launch

In order to maximize the chance of the bonded kernel tails to overlap
as well as to avoid GPU idling during the scheduling gaps between these
kernels, this commit swaps the launch order of bonded and nonbonded work.

Refs #2677

Change-Id: Ia4e5e7279e7ea4cf575c76b5286cf10387258878

History

#1 Updated by Szilárd Páll 8 months ago

  • Description updated (diff)

#2 Updated by Szilárd Páll 8 months ago

The current code implements a simplified scheduling compared to the above, everything is scheduled in the NB stream. This is simpler, and in a way more robust to keeping correct (thanks sequential dependency), but likely quite inferior for overlap of kernel tails and gaps between tasks (especially with non-fused kernels).

I do not have the time to change this before next week; if noone else can, this could be postponed for beta2 (though not sure if it is acceptable by the beta policy).

#3 Updated by Mark Abraham 8 months ago

  • Target version changed from 2019-beta1 to 2019-beta2

Not happening before beta2

#4 Updated by Paul Bauer 8 months ago

  • Target version changed from 2019-beta2 to 2019-beta3

Don't think this will happen before beta3.

#5 Updated by Mark Abraham 7 months ago

  • Target version changed from 2019-beta3 to 2019

#6 Updated by Gerrit Code Review Bot 7 months ago

Gerrit received a related DRAFT patchset '2' for Issue #2677.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2019~I9861997e72e34b4fa9672bcb792d16c80f67c93f
Gerrit URL: https://gerrit.gromacs.org/8725

#7 Updated by Szilárd Páll 7 months ago

I've implemented and tested reasonable well the propose scheduling into an independent stream. Given the limitations of the current implementation (non-fused interaction kernels) and the limitations imposed bythe CUDA API (only two priority bits), even when we'd want to ensure sure the bonded work is does progress as it's on the critical path with DD, we can't avoid it competing with the nonlocal nonbonded kernel.

Therefore, unfortunately, exposing concurrency would come at a very limited benefit. Despite the relatively small footprint of the change (~150 LOC), I still propose to move it to master and merge it there (sooner rather than later).

Without exposing bondeds as a concurrent task, what we should do is to move the bonded launch before the nonbonded so the small kernels at least have a chance to overlap with other work in other steams (PME or the preempted nonbonded kernel with DD).

#8 Updated by Szilárd Páll 7 months ago

  • Status changed from New to In Progress
  • Difficulty hard added
  • Difficulty deleted (uncategorized)

#9 Updated by Szilárd Páll 7 months ago

Need to split xq transfer from the nbnxn kernel launch as a prerequsite -- which incidentally clarifies the issue related to #2786

#10 Updated by Gerrit Code Review Bot 7 months ago

Gerrit received a related patchset '1' for Issue #2677.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2019~Ie50e6a6b664f8400274b2f409eacb6c36f0908ba
Gerrit URL: https://gerrit.gromacs.org/8767

#11 Updated by Gerrit Code Review Bot 7 months ago

Gerrit received a related patchset '1' for Issue #2677.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2019~Ia4e5e7279e7ea4cf575c76b5286cf10387258878
Gerrit URL: https://gerrit.gromacs.org/8769

#12 Updated by Szilárd Páll 7 months ago

There seems to be a lucky side-effect of this scheduling change: small kernels are easier to preempt by PME (so PME on GPU progresses better when overlapped with bondeds) additionally to the small but measurable benefit of better overlap when bondeds are scheduled ahead of the nonbonded kernel with DD.

#13 Updated by Szilárd Páll 7 months ago

  • Status changed from In Progress to Resolved

We'll reconsider this after the kernels get fused and made friendlier to scheduling. For now, the current scheduling tweaks are likely as well as we can get it (note: one major drawback is that launching all these small bonded kernels introduced a large CPU-side overhead).

#14 Updated by Paul Bauer 6 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF