Project

General

Profile

Task #3370

Further improvements to GPU Buffer Ops and Comms

Added by Alan Gray 8 months ago. Updated 7 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

Umbrella task for follow-up improvements.

[HIGH PRIORITY] Unification of code-paths across different types of step in do_force

  • Allow GPU Force Buffer ops to be active on virial steps
  • Unify X/F Buffer ops flags
  • Allow GPU PME-PP comms to be active on virial steps
  • Allow GPU halo exchange to be active on virial steps (requires extension to include shift force contribution)
  • Unify and simplify X/F Halo exchange triggers. See comments below.
  • Allow GPU X buffer ops to be active on search steps. Update: realized this is not required since there are no X buffer ops calls from do_force on search steps.

[HIGH PRIORITY] Refactoring

[HIGH PRIORITY] Force buffer op and reduction cleanup/improvement

Previous general discussion at https://redmine.gromacs.org/issues/3029
  • Rework GPU direct halo-exchange related force reduction complexities
  • Centralize and clarify GPU force buffer clearing: The responsibility of (rvec) force buffer clearing should be moved into StatePropagatorDataGpu and arranged for such that this is not a task on the critical path (as it as right now in GpuHaloExchange::Impl::communicateHaloForces()).
  • At the same time, we need to
    • skip CPU-side force buffer clearing if there are no CPU forces computed
    • check all code-paths and make sure we can not end up with reduction kernels accumulating into non-initialized buffers.
  • Launch the transform kernel back-to-back after the nonbonded rather than later, next to the CPU buffer ops/reduction
  • the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
  • consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
  • Ideally the force-reduction should not be called from a method of the nonbonded module (especially due to the complexities of CPU/GPU code-paths) - consider reorganizing reductions

Remove Limitations

Timing

  • add missing cycle counters related to buffer ops/reduction launches

Improve synchronization

  • Implement better receiver ready / notify in halo exchange: Current notification mechanisms render the one-sided communication synchronous two-sided. Alternatives should be considered.
  • Separate PME x receive sync: the data dependency sychronization should be implemented on the consumer task's end which is PME spread in the case of PME. PME-only ranks have the receive enqueue wait as soon as MPI returns. Consider assembling a list of events and passed to spread instead. Consider whether having to receive from multiple PP ranks actually makes is more beneficial to overlap some receive with event wait enqueue.

Investigate GPU f buffer ops use cases

Check if there is any performance benefits to be had and in which regimes for x / f buffer opts without GPU update in:
  • runs with DD and CPU update
    • x buffer ops: offloadable with a likely simple crossover heuristic threshold; i.e. below N atoms/core not offloaded (locals or also nonlocals, with/without CPU work?)
    • f buffer ops: heuristics likely more complex criteria (as it is combined with reductions)
  • runs with / without DD and vsites
    • with GPU update requires D2H and H2D -- is it worth it, test use-cases (e.g. multiple ranks per GPU, both ensemble and DD runs, transfers might be overlapped)
    • without GPU update: same applies as above non-vistes runs just wait on D2H needs to be earlier

evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)


Subtasks

Feature #2890: GPU Halo ExchangeIn Progress
Task #3089: relax dlb scaling limit when that would suit GPU halo exchangeClosedBerk Hess
Task #3092: implement better receiver ready / notify in halo exchange Closed
Task #3104: implement GPU DD cycle countingNew
Task #3156: move ddUsesGpuDirectCommunication and related conditionals into the workload data structuresClosedAlan Gray
Feature #2891: PME/PP GPU communications In Progress
Task #3077: PME/PP GPU Comms unique pointer deletion causes seg fault when CUDA calls exist in destructorFeedback wanted
Task #3105: implement GPU PME/PP comm cycle countingNew
Task #3157: separate PME x receive syncClosedAlan Gray
Task #3158: use MPI_Isend() in sendFToPpCudaDirect and receiveCoordinatesFromPpCudaDirectClosedAlan Gray
Bug #3164: mdrun-mpi-test with separate PME ranks and PP-PME CPU comm crashesClosedAlan Gray
Feature #2915: GPU direct communicationsIn Progress
Task #3082: move launch/synchronization points to clarify task dependenciesNew
Feature #3087: enable GPU peer to peer accessClosed
Task #2965: Performance of GPU direct communicationsIn Progress
Feature #3021: Completion of docs for GPU developmentsFeedback wanted
Task #3093: rework GPU direct halo-exchange related force reduction complexities In Progress
Task #3106: Implement multiple pulses with GPU communicationClosedAlan Gray
Bug #3159: eliminate regression due to moving gmx_pme_send_coordinates()ClosedAlan Gray
Feature #3160: implement direct comm for different src/target memory spacesFix uploadedAlan Gray

History

#1 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#2 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#3 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#4 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#5 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#6 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#7 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#8 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#9 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#10 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#11 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#12 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#13 Updated by Alan Gray 8 months ago

  • Description updated (diff)

#14 Updated by Alan Gray 7 months ago

  • Description updated (diff)

#15 Updated by Alan Gray 7 months ago

  • Description updated (diff)

#16 Updated by Szilárd Páll 7 months ago

  • Unify X/F Buffer ops flags
  • x buffer ops: offloadable with a likely simple crossover heuristic threshold; i.e. below N atoms/core not offloaded (locals or also nonlocals, with/without CPU work?)

Have we done measurements of the cross-over of CPU time of CPU vs GPU buffer ops? If we have not, the above goals do conflict. Unifying the flags means the x buffer ops trigger can not be tuned based on a #atoms threshold.

Additionally, the x/f buffer ops are entirely different tasks so I see little benefit in merging their workload flags -- other than saving a few bytes in the workload data structure.

  • Allow GPU X buffer ops to be active on search steps. Update: realized this is not required since there are no X buffer ops calls from do_force on search steps.

On search steps the search produces nonbonded-layout x, so technically it is not needed. We could change that and avoid having the search store the coordinates and call the buffer ops instead. The benefit would be uniform behavior on the GPU across all steps but different behavior for CPU and GPU search.

#17 Updated by Szilárd Páll 7 months ago

Szilárd Páll wrote:

  • Unify X/F Buffer ops flags
  • x buffer ops: offloadable with a likely simple crossover heuristic threshold; i.e. below N atoms/core not offloaded (locals or also nonlocals, with/without CPU work?)

Have we done measurements of the cross-over of CPU time of CPU vs GPU buffer ops?

See https://redmine.gromacs.org/issues/3029#note-1; IIRC that was CPU vs GPU kernel time, but CPU critical path will be affected more by kernel launch cost (at least until we can overlap GPU launch with CPU execution).

#18 Updated by Alan Gray 7 months ago

  • Description updated (diff)

#19 Updated by Alan Gray 7 months ago

The idea is to try and simplify the logic in do_force (and do_md) by unifying flags and ultimately code-paths, to improve readability and maintainability of the code (and reduce the scope of required test coverage). I acknowledge that this is in conflict with the idea of developments that add more flexibility to hardware scheduling through heuristics, which may have some performance benefits but would further increase complexity. I suggest that we focus on simplification/cleanup in the short term, and put the latter idea on the backburner as a possible future optimization task.

#20 Updated by Artem Zhmurov 7 months ago

Alan Gray wrote:

The idea is to try and simplify the logic in do_force (and do_md) by unifying flags and ultimately code-paths, to improve readability and maintainability of the code (and reduce the scope of required test coverage). I acknowledge that this is in conflict with the idea of developments that add more flexibility to hardware scheduling through heuristics, which may have some performance benefits but would further increase complexity. I suggest that we focus on simplification/cleanup in the short term, and put the latter idea on the backburner as a possible future optimization task.

I agree with Szilard here. The X buffer ops most likely should be enabled when we need coordinates in nbat format on the GPU. I think we can even live with re-doing them on the GPU on search steps, thus the XBuffOps flag will naturally go away. This will also eliminate the need of copy_nbat_coordinates_host_to_device function and logic around it. F buffer ops need more work before the corresponding flag is eliminated the same way.

#21 Updated by Szilárd Páll 7 months ago

This is not just a matter of leaving room for optimization. The two tasks in question. x buffer ops and f buffer ops + reduction, are entirely different tasks, so it does make sense to keep them separate.

Also note that tasks themselves will not be "eliminated" (unless underlying algorithms change) and therefore it is entirely reasonable to have workload flags corresponding to these. These flags define the schedule and therefore, unless a task becomes trivial or merged into another across all code paths (i.e. all GPU code-paths support X buffer ops and always schedule it together with some other nbnxm task), the XBuffOps flag can't go away.

Last, I see only a very small code (LOC/logic) siomplification in using one simulationWorkload.useGpuBufferOps versus two stepWorkload workload flags.
Sode-note: the current thinking is that we should have an inclusive stepWorkload data structure that contains all the higher level flags and is constructed ahead of time for N steps.

#22 Updated by Alan Gray 7 months ago

OK, noted. I will update the existing patches in gerrit accordingly.

#23 Updated by Alan Gray 7 months ago

  • Description updated (diff)

#24 Updated by Alan Gray 7 months ago

  • Description updated (diff)

Also available in: Atom PDF