Task #3370

Updated by Alan Gray 10 months ago

h1. Umbrella task for follow-up improvements.

h2. Unification of code-paths across different types of step in do_force

* The GPU buffer ops/reduction kernel should grow a feature to do virial calculation based on the bonded and short-range nonbonded contribution to avoid having to fall back to CPU on virial steps. Additionally, if there are any bonded interactions calculated on the CPU, these forces need to be transferred separately for the virial reduction on virial steps.
* Allow uniformity across search and non-search steps (ATM at search the xyzq fmt is generated in search and copied while non-ns steps do the conversion which leads to complexity).

h2. Improve synchronization

* Implement better receiver ready / notify in halo exchange: Current notification mechanisms render the one-sided communication synchronous two-sided. Alternatives should be considered.
* Separate PME x receive sync: the data dependency sychronization should be implemented on the consumer task's end which is PME spread in the case of PME. PME-only ranks have the receive enqueue wait as soon as MPI returns. Consider assembling a list of events and passed to spread instead. Consider whether having to receive from multiple PP ranks actually makes is more beneficial to overlap some receive with event wait enqueue.

h2. Refactoring

* Consider all possible combinations of triggers, and how to combine optimally in each case

h2. Timing

* add missing cycle counters related to buffer ops/reduction launches

h2. Force buffer op and reduction cleanup/improvement

centralize and clarify GPU force buffer clearing
* The responsibility of (rvec) force buffer clearing should be moved into StatePropagatorDataGpu and arranged for such that this is not a task on the critical path (as it as right now in GpuHaloExchange::Impl::communicateHaloForces()).
* At the same time, we need to
** skip CPU-side force buffer clearing if there are no CPU forces computed
** check all code-paths and make sure we can not end up with reduction kernels accumulating into non-initialized buffers.

h2. Investigate GPU f buffer ops use cases

Check if there is any performance benefits to be had and in which regimes for x / f buffer opts without GPU update in:
* runs with DD and CPU update
** x buffer ops: offloadable with a likely simple crossover heuristic threshold; i.e. below N atoms/core not offloaded (locals or also nonlocals, with/without CPU work?)
** f buffer ops: heuristics likely more complex criteria (as it is combined with reductions)
* runs with / without DD and vsites
** with GPU update requires D2H and H2D -- is it worth it, test use-cases (e.g. multiple ranks per GPU, both ensemble and DD runs, transfers might be overlapped)
** without GPU update: same applies as above non-vistes runs just wait on D2H needs to be earlier

<Tasks to be added here>