Project

General

Profile

Task #3114

Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication

Feature #2888: CUDA Update and Constraints module

Possible improvements to update-constraints

Added by Artem Zhmurov about 1 month ago. Updated 1 day ago.

Status:
New
Priority:
Low
Assignee:
Category:
-
Target version:
-
Difficulty:
simple
Close

Description

Performance.

  • Merge Leap-Frog, LINCS and SETTLE kernels. The kernel launch overhead can be significant. The merge with SETTLE is trivial, LINCS will require kernel reorganization.
  • Do not save intermediate coordinates if there is no constraints. Coordinates before and after the update are needed simultaneously when constraints are evaluated. Hence, the coordinates before the update should be kept only when there are constraints in the system.
  • LINCS kernel: Warp-level synchronization for coupled constraints. Consider inserting gaps at the end of the warp, not at the end of the block. This way no constrain will connect atoms from different warps, hence, warp-level synchronization can be used.
  • LINCS kernel: Move more data to local/shared memory. Some intermediate values (e.g. matrix A) are saved into the global memory. Using shared memory would improve performance.
  • LINCS kernel: Use analytical solution for the matrix A inversion? This can be beneficial with H-Bonds only constraints (matrix A is small).
  • SETTLE kernel: Read only one index per water molecule. If the atoms are consequent in memory, only one index is needed.
  • SETTLE kernel: Use different ordering for matrices. Some matrices can be transposed to make a better use of vector operations.
  • Virial reduction: shuffle reduction.
  • Virial reduction: reduce the number of atomic operations.

Code organization.

  • Reconsider the naming of coordinates buffer in constraints. Currently, the variable naming inside the constraints is identical to CPU code: x is used for the coordinates before update, xp - to read intermediate and save final coordinates.This can be confucing when constructing the function call. Consider re-evaluating the naming or using different names to avoid the confusion with the CPU code.
  • Use the same parameters and parameters initialization in GPU and CPU versions of SETTLE. Currently, the CPU and GPU versions use very similar data-structures and their initialization procedures are essentially duplicated. These should be separated into a new data-structure. (https://gerrit.gromacs.org/#/c/gromacs/+/10744/, https://gerrit.gromacs.org/#/c/gromacs/+/10758/).
  • Reconsider naming. Use GPU suffix instead of CUDA, since there are plans to generalize the implementations to other platforms.
  • Virial reduction. Make a proper unified virial reduction. Although it is not called on each step, this can improve performance if implemented properly (shuffle reduction, reduce number of atomics).
  • Unify the PBC management. LINCS, SETTLE and Leap-Frog all use similar routines to manage the PBC data. Should be unified.

New features.

  • Multi-GPU support in LINCS.
  • FEP in LINCS. Easy to implement, but tests should be improved yet again.

Associated revisions

Revision 79aab161 (diff)
Added by Artem Zhmurov about 1 month ago

Eliminate D2D copy in update constraints

The intermediate coordinates (x' or xp) are only needed inside
the update-constraints module (for the constraints algorithms)
and never used outside. Hence, the xp variable can be used to
save the coordinates before update, while x stores the final
coordinates. This way, there is no need to make a D2D xp->x
copy after applying the constraints, since x will have the
correct data.

Refs. #2888, #3114.

Change-Id: I363b633976a236a8e2bf2137c21d3bf0a765cb06

History

#1 Updated by Szilárd Páll about 1 month ago

Suggest splitting things: code quality improvements and performance improvements (unless the former results in the latter) would be best gathered separately. The latter should be a list with an order/prio markings because we want to know what are the low-hanging fruit optimizations that could have impact and may be doable before RC.

#2 Updated by Artem Zhmurov about 1 month ago

  • Description updated (diff)

#3 Updated by Artem Zhmurov about 1 month ago

  • Description updated (diff)

#4 Updated by Szilárd Páll 1 day ago

Overlapping H2D of CPU force contributions with GPU force compute seems related, but perhaps doesn't directly belong here, so I'll file a separate redmine.

#5 Updated by Szilárd Páll 1 day ago

Szilárd Páll wrote:

Overlapping H2D of CPU force contributions with GPU force compute seems related, but perhaps doesn't directly belong here, so I'll file a separate redmine.

Actually, it is rather related to the F buffer ops.

Also available in: Atom PDF