Feature #3029

Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms

Feature #2817: GPU X/F buffer ops

GPU force buffer ops + reduction

Added by Szilárd Páll 3 months ago. Updated about 1 month ago.

In Progress
Target version:


Implement the force buffer layout transform on GPU (from nbmxn to native layout); as in case of the CPU, this tasks could in some cases be combined with the force reduction to produce the final force-buffer that the integration can take as input.

Use-cases (TODO list all use-cases/subcases):
  1. buffer ops transform-only on GPU * optional D2H transfer of force buffer (by default transfer)
  2. reduce with other on-GPU forces (currently PME) * needs efficient on-device sync * needs to make sure that the reduced forces are not D2H trasferred * needs to consider virial calculation (to be done on the bonded + nonbonded forces)
  3. reduction with CPU-side forces * ensure that CPU-side forces H2D is issued as early as possible * also consider virial contribution * evaluate whether this is more efficient to do when DD halo exchange is anyway doing staged copy and forces are already in cache (and need to be in CPU cache for MPI);
General considerations:
  • launch the transform kernel back-to-back after the nonbonded rather than later, next to the CPU buffer ops/reduction
  • evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)
  • implement on-GPU event sync between PME and reduction task in the nonbonded queue (see fully functional prototype for the coordianate dependency shared earlier here:


Feature #3052: GPU virial reduction/calculationNew

Associated revisions

Revision 0a4ca2c4 (diff)
Added by Alan Gray about 1 month ago

PME reduction for CUDA F buffer operations

Enable with GMX_USE_GPU_BUFFER_OPS env variable.

Provides functionality to perform reduction of PME forces in F buffer
ops kernel. Currently active when single GPU performs both PME and PP
(multi-GPU support will follow in patch which perfoms PME/PP comms
direct between GPUs). When active, Device->Host copy of PME force
and CPU-side reduction is disabled.

Implements part of #3029, refs #2817

Change-Id: I3e66b6919c1e86bf0bed42b74136f8694626910b

Revision c5bcd713 (diff)
Added by Szilárd Páll 7 days ago

Reorganize on-GPU PME force reduction flag handling

Instead of passing around a flag everywhere that tells PME whether
forces are reduced on GPU or CPU (and whether transfer needs to happen
for the latter), we pass the flag once when configuring PME for
the next step and store it internally.

Refs #3029

Change-Id: I81fa2dc93dd979e2b85b4d7fe8cf266a3fde9b8f


#1 Updated by Szilárd Páll 3 months ago

Szilárd Páll wrote:

  • evaluate what is #atoms threshold under which it is not worth taking the 10-15 us overhead of kernel launch (especially for non-local buffer ops)

Quick test on a Xeon E5 2620v4 gives ~25k atoms as crossover point for F buffer ops (12-14 us), i.e. ~3000 atoms/core (for rather slow cores). This means that at least in the strong-scaling regime, if we don't use direct GPU communication, at least the nonlocal buffer ops would be better off on the CPU.

#2 Updated by Szilárd Páll about 1 month ago

  • Description updated (diff)

Also available in: Atom PDF