Project

General

Profile

Task #2678

Task #2675: bonded CUDA offload task

bonded force reduction

Added by Szilárd Páll over 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Possible strategies:
* if we schedule the bonded kernel in the same stream as the nonbondeds, we can just reuse the nonbonded force output. Simple, but drawback is that nonbonded and bonded kernels can't overlap. won't work as the current code uses a per-component intermediate storage
  • if we schedule in a separate stream, we can:
    - use atomic ops for force accumulation (simple but likely less efficient)
    - use a separate reduction kernel after the nonbondeds complete (possibly use the former with DD to have the results quicker?)

Note: the current draft contrib code (https://gerrit.gromacs.org/#/c/8460/4/src/gromacs/mdlib/nbnxn_cuda/gpuBondedCUDA.cu) uses a reduction stage after each an every bonded kernel which would be bestreconsidered, but if time is short we could leave it as is.

Associated revisions

Revision b9e713b6 (diff)
Added by Jonathan Vincent about 2 years ago

Add CUDA bonded kernels

CUDA bonded kernels are added for the most common bonded and LJ-14
interactions.
The default auto settings of mdrun offloads these interactions
to the GPU when possible.
Currently these interactions are computed in the local or non-local
nbnxn non-bonded streams. We should consider using a separate stream.
This change uses synchronous transfers. A child change will change
these to asynchronous.

Updated release notes and performance guide.

Fixes #2678
Refs #2675

Change-Id: Ifc6d97854cc7afa8526602942ec3b1712ba45bac

History

#1 Updated by Szilárd Páll over 2 years ago

  • Description updated (diff)

#2 Updated by Szilárd Páll over 2 years ago

  • Category set to mdrun
  • Target version set to 2019-beta1

We've looked at the atomics-based reduction and it seems to be a lot faster than the current naive one, so we'll go with that. This means that the remaining task is to decide on how will we do the final force reduction. There are two important distinct cases:
- without DD with NB and PME also running on the same device: could share force output with PME?
- with DD when non-local forces are needed early for comm: need separate buffer possibly further reduced on the CPU-side

#3 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '10' for Issue #2678.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~Ia19452df74407186aaff350d8df27dfc3d7d359f
Gerrit URL: https://gerrit.gromacs.org/8538

#4 Updated by Szilárd Páll over 2 years ago

  • Status changed from New to Resolved

#5 Updated by Mark Abraham over 2 years ago

  • Status changed from Resolved to Closed

#6 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2678.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~Ifc6d97854cc7afa8526602942ec3b1712ba45bac
Gerrit URL: https://gerrit.gromacs.org/8597

Also available in: Atom PDF