Project

General

Profile

Feature #3029

Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms

Feature #2817: GPU X/F buffer ops

GPU force buffer ops + reduction

Added by Szilárd Páll 18 days ago. Updated 18 days ago.

Status:
In Progress
Priority:
High
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Implement the force buffer layout transform on GPU (from nbmxn to native layout); as in case of the CPU, this tasks could in some cases be combined with the force reduction to produce the final force-buffer that the integration can take as input.

Use-cases (TODO list all use-cases/subcases):
  1. buffer ops transform-only on GPU * optional D2H transfer of force buffer (by default transfer)
  2. reduce with other on-GPU forces (currently PME) * needs efficient on-device sync * needs to make sure that the reduced forces are not D2H trasferred * needs to consider virial calculation (to be done on the bonded + nonbonded forces)
  3. reduction with CPU-side forces * ensure that CPU-side forces H2D is issued as early as possible * also consider virial contribution * evaluate whether this is more efficient to do when DD halo exchange is anyway doing staged copy and forces are already in cache (and need to be in CPU cache for MPI);
General considerations:
  • launch the transform kernel back-to-back after the nonbonded rather than later, next to the CPU buffer ops/reduction
  • evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)
  • TODO

History

#1 Updated by Szilárd Páll 18 days ago

Szilárd Páll wrote:

  • evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)

Quick test on a Xeon E5 2620v4 gives ~25k atoms as crossover point for F buffer ops (12-14 us), i.e. ~3000 atoms/core (for rather slow cores). This means that at least in the strong-scaling regime, if we don't use direct GPU communication, at least the nonlocal buffer ops would be better off on the CPU.

Also available in: Atom PDF