Task #2675: bonded CUDA offload task
bonded task scheduling
The bonded kernels in the first version should be scheduled after the nonlocal coordinates arrive to the GPU; the task should make sure to wait for both local and nonlocal coordinates to be ready.
Note that the the nbnxn force array transfer will need a stream dependency on the bonded kernels' completion (conditional on whether the bondeds are offloaded).
Kernel priority TBD:
- with DD: high prio because there is no local/non-local split and dd_move_f() depends on all bondeds
- w/o DD: (likely also) high prio given that otherwise it might get preempted by PME and the chain of kernel (if not fused) get delayed later by the long nonbonded kernel.
Split nbnxn input copy and kernel launch
The nonbonded x+q host-to-device copy and kernel launch is split into
two functions and called separately from do_force().
This will allow improving the bonded scheduling and better expressing a
missing bonded dependency (and fixing the related bug).
This change only moves code.
Swap the order of GPU bonded and nonbonded launch
In order to maximize the chance of the bonded kernel tails to overlap
as well as to avoid GPU idling during the scheduling gaps between these
kernels, this commit swaps the launch order of bonded and nonbonded work.
#2 Updated by Szilárd Páll almost 2 years ago
The current code implements a simplified scheduling compared to the above, everything is scheduled in the NB stream. This is simpler, and in a way more robust to keeping correct (thanks sequential dependency), but likely quite inferior for overlap of kernel tails and gaps between tasks (especially with non-fused kernels).
I do not have the time to change this before next week; if noone else can, this could be postponed for beta2 (though not sure if it is acceptable by the beta policy).
#7 Updated by Szilárd Páll almost 2 years ago
I've implemented and tested reasonable well the propose scheduling into an independent stream. Given the limitations of the current implementation (non-fused interaction kernels) and the limitations imposed bythe CUDA API (only two priority bits), even when we'd want to ensure sure the bonded work is does progress as it's on the critical path with DD, we can't avoid it competing with the nonlocal nonbonded kernel.
Therefore, unfortunately, exposing concurrency would come at a very limited benefit. Despite the relatively small footprint of the change (~150 LOC), I still propose to move it to master and merge it there (sooner rather than later).
Without exposing bondeds as a concurrent task, what we should do is to move the bonded launch before the nonbonded so the small kernels at least have a chance to overlap with other work in other steams (PME or the preempted nonbonded kernel with DD).
#12 Updated by Szilárd Páll almost 2 years ago
There seems to be a lucky side-effect of this scheduling change: small kernels are easier to preempt by PME (so PME on GPU progresses better when overlapped with bondeds) additionally to the small but measurable benefit of better overlap when bondeds are scheduled ahead of the nonbonded kernel with DD.
#13 Updated by Szilárd Páll almost 2 years ago
- Status changed from In Progress to Resolved
We'll reconsider this after the kernels get fused and made friendlier to scheduling. For now, the current scheduling tweaks are likely as well as we can get it (note: one major drawback is that launching all these small bonded kernels introduced a large CPU-side overhead).