Project

General

Profile

Task #3370

Updated by Alan Gray 10 months ago

h1. Umbrella task for follow-up improvements.

h2. [HIGH PRIORITY] Unification of code-paths across different types of step in do_force

* Allow GPU Force Buffer ops to be active on virial steps
** Fix uploaded https://gerrit.gromacs.org/c/gromacs/+/15960
* Unify X/F Buffer ops flags
** Above fix unifies to single stepWork flag
** https://gerrit.gromacs.org/c/gromacs/+/15961 moves to unified simulationWork flag
*
Allow GPU PME-PP comms to be active on virial steps
* Allow GPU halo exchange to be active on virial steps (requires extension to include shift force contribution)
* Unify and simplify X/F Halo exchange triggers.
* -Allow
Allow GPU X buffer ops to be active on search steps. Update: realized this is not required since there are no X buffer ops calls from do_force on search steps.- steps
* Unify and simplify X/F triggers in do_force, now allowed by above tasks.


h2. [HIGH PRIORITY] Refactoring

* eliminate regression due to moving gmx_pme_send_coordinates()
** subtask https://redmine.gromacs.org/issues/3159
** Fix uploaded https://gerrit.gromacs.org/c/gromacs/+/15200
* move ddUsesGpuDirectCommunication and related conditionals into the workload data structures
** Fix uploaded https://gerrit.gromacs.org/c/gromacs/+/15437

h2. [HIGH PRIORITY] Force buffer op and reduction cleanup/improvement

Previous general discussion at https://redmine.gromacs.org/issues/3029
* Rework GPU direct halo-exchange related force reduction complexities
** subtask https://redmine.gromacs.org/issues/3093
* Centralize and clarify GPU force buffer clearing: The responsibility of (rvec) force buffer clearing should be moved into StatePropagatorDataGpu and arranged for such that this is not a task on the critical path (as it as right now in GpuHaloExchange::Impl::communicateHaloForces()).
** Previous discussion at https://redmine.gromacs.org/issues/3142
* At the same time, we need to
** skip CPU-side force buffer clearing if there are no CPU forces computed
** check all code-paths and make sure we can not end up with reduction kernels accumulating into non-initialized buffers.
* Launch the transform kernel back-to-back after the nonbonded rather than later, next to the CPU buffer ops/reduction
* Ideally the force-reduction should not be called from a method of the nonbonded module (especially due to the complexities of CPU/GPU code-paths) - consider reorganizing reductions

h2. Remove Limitations

* -Implement multiple pulses within GPU halo exchange communication-
** -subtask https://redmine.gromacs.org/issues/3106-
** Fix uploaded https://gerrit.gromacs.org/c/gromacs/+/14723
* Implement multiple dimensions within GPU halo exchange communication
* Extend PME-PP communication to support case where PME is on CPU and PP is on GPU.
** subtask https://redmine.gromacs.org/issues/3160
** fix uploaded https://gerrit.gromacs.org/c/gromacs/+/14223
* Extend PME-PP communication to support coordinate send from CPU
** subtask https://redmine.gromacs.org/issues/3160
** fix uploaded https://gerrit.gromacs.org/c/gromacs/+/14238

h2. Timing

* add missing cycle counters related to buffer ops/reduction launches

h2. Improve synchronization

* Implement better receiver ready / notify in halo exchange: Current notification mechanisms render the one-sided communication synchronous two-sided. Alternatives should be considered.
* Separate PME x receive sync: the data dependency sychronization should be implemented on the consumer task's end which is PME spread in the case of PME. PME-only ranks have the receive enqueue wait as soon as MPI returns. Consider assembling a list of events and passed to spread instead. Consider whether having to receive from multiple PP ranks actually makes is more beneficial to overlap some receive with event wait enqueue.

h2. Investigate GPU f buffer ops use cases

Check if there is any performance benefits to be had and in which regimes for x / f buffer opts without GPU update in:
* runs with DD and CPU update
** x buffer ops: offloadable with a likely simple crossover heuristic threshold; i.e. below N atoms/core not offloaded (locals or also nonlocals, with/without CPU work?)
** f buffer ops: heuristics likely more complex criteria (as it is combined with reductions)
* runs with / without DD and vsites
** with GPU update requires D2H and H2D -- is it worth it, test use-cases (e.g. multiple ranks per GPU, both ensemble and DD runs, transfers might be overlapped)
** without GPU update: same applies as above non-vistes runs just wait on D2H needs to be earlier

evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)

Back