Project

General

Profile

Feature #3142

Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication

Feature #2817: GPU X/F buffer ops

Feature #3029: GPU force buffer ops + reduction

centralize and clarify GPU force buffer clearing

Added by Szilárd Páll about 1 year ago. Updated 10 months ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
mdrun
Difficulty:
uncategorized
Close

Description

The responsibility of (rvec) force buffer clearing should be moved into StatePropagatorDataGpu and arranged for such that this is not a task on the critical path (as it as right now in GpuHaloExchange::Impl::communicateHaloForces()).

At the same time, we need to
  • skip CPU-side force buffer clearing if there are no CPU forces computed
  • check all code-paths and make sure we can not end up with reduction kernels accumulating into non-initialized buffers.

History

#1 Updated by Szilárd Páll about 1 year ago

Note: this is high priority, but we can consider bumping it to beta3 except the last item and its impact on the buffer ops +/- GPU update code-path.

#2 Updated by Alan Gray about 1 year ago

The responsibility of (rvec) force buffer clearing should be moved into StatePropagatorDataGpu and arranged for such that this is not a task on the critical path (as it as right now in GpuHaloExchange::Impl::communicateHaloForces()).

This has been done in the pending change
https://gerrit.gromacs.org/c/gromacs/+/13885

It is now a StatePropagatorDataGpu method operating in the non local stream.

#3 Updated by Paul Bauer about 1 year ago

  • Target version changed from 2020-beta2 to 2020-beta3

bump

#4 Updated by Szilárd Páll about 1 year ago

  • Status changed from New to In Progress

#5 Updated by Szilárd Páll about 1 year ago

Szilárd Páll wrote:

At the same time, we need to
  • skip CPU-side force buffer clearing if there are no CPU forces computed

Quick test shows that switching off force clearing saves 7% runtime (with 6k/core); as this is typically offloaded, it only contributes to avoiding cache pollution. However, switching to doing force reduction (when this is done on the CPU) to store instead of accumulate saves another 6-8% of runtime (assuming that the run is completely GPU bound), so that is in the tested case a potential ~15% performance improvement when there are no CPU force tasks but reduction is done on the CPU.

This would especially help the default GPU code-path (reductions on CPU with CUDA as well as the OpenCL path).

  • check all code-paths and make sure we can not end up with reduction kernels accumulating into non-initialized buffers.

Mostly done. An issue going forward identified, filed as a separate redmine (#3216).

#7 Updated by Paul Bauer about 1 year ago

  • Target version changed from 2020-beta3 to 2021-infrastructure-stable

2021 is a more realistic target

#8 Updated by Szilárd Páll about 1 year ago

What Artem suggested is a task/cleanup related to the GPU comm feature, so indeed that should have been done by now

The rest are WIP (in particular the conditional clearing/reduction) and should be addressed IMO. It would be a great disservice to our users to not allow getting the benefits of existing code (especially if we instead focus on adding more unstable code).

#9 Updated by Alan Gray 10 months ago

  • Status changed from In Progress to Closed

Also available in: Atom PDF