Feature #2885

Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication

CUDA version of LINCS

Added by Artem Zhmurov about 1 year ago. Updated 2 months ago.

core library


Adapt the LINCS constraints to work efficiently on CUDA-enabled GPUs.


  • A separate class that contains the logic.
  • Reduction for the virial using shuffle.
  • Many-GPU version.
  • Free energy.

Ideas for kernel improvement:

  • Use analytical solution for matrix A inversion (for small matrices of H-bonds constraints), inverted matrix itself can be reused rather than recomputed.
  • Move more data to local/shared memory and try to get rid of atomics (at least on the device level).
  • Use locality of coupled constraints better (maybe go from block-sync to warp-sync)
  • Introduce mapping of thread id to both single constraint and single atom, thus designating Nth threads to deal with Nat <= Nth coupled atoms and Nc <= Nth coupled constraints.


  • Initial integration to the constraints test.
  • Add bigger systems to test virial reduction and overall redistribution of constraints among threads.
  • Generalization of tests for different platforms.

Related issues

Related to GROMACS - Feature #2886: CUDA version of SETTLEClosed
Related to GROMACS - Feature #2887: CUDA version of Leap Frog algorithmClosed
Related to GROMACS - Feature #2888: CUDA Update and Constraints moduleClosed

Associated revisions

Revision 0a1aae78 (diff)
Added by Artem Zhmurov 11 months ago

CUDA version of LINCS constraints.

Implementation of the LINCS constraints for NVIDIA GPUs.
Currently works isolated from the other parts of the code:
coordinates and velocities are copied to and from GPU on
every integration timestep. Part of the GPU-only loop.
Loosely based on change 9162 by Alan Gray. To enable,
set the environmental variable GMX_LINCS_GPU.

1. Works only if the constraints can be split in short
uncoupled groups (currently < 256, designed for H-bonds
2. Does not change the matrix inversion order for costraints
3. Does not support free energy computations.
4. Assumes no communications between domains (i.e. assumes that
there is no constraints connecting atoms from two different
5. Number of thread per blocks should be a power of 2 for
reduction of virial to work.

1. Move more data from the global memory to local.
2. Change .at() to []
3. Add sorting by the number of coupled constraints to decrease
warp divergencies.
4. numAtoms should be changeable (for multi-GPU case).

Refs #2816, #2885

Change-Id: I3c975cf898053b7467bcd30459e60ce2c8852be6

Revision 747c371c (diff)
Added by Artem Zhmurov 9 months ago

Memory management fixes in CUDA version of LINCS

This fix is to prepare LINCS to run with DD.

1. The masses array size depends on the current number of atoms
rather than on the number of constraints.
2. The size of other arrays should be based on the number of
threads launched on the GPU, which include padding added to
align coupled constraints with the thread blocks. Also
renamed variable according to conventions.

Refs #2885 and #2888

Change-Id: I20cb53ebc6da6a1ff2ee1e385613b27c4a01d11f

Revision af1e0e7e (diff)
Added by Artem Zhmurov 2 months ago

Rename LincsCuda into LincsGpu

This is to folow general naming conventions across the code.

Refs #2885, #2888.

Change-Id: Ifa7e3febeff1d958155ed02daa97d26e828e8381


#1 Updated by Artem Zhmurov about 1 year ago

#2 Updated by Artem Zhmurov about 1 year ago

  • Related to Feature #2887: CUDA version of Leap Frog algorithm added

#3 Updated by Artem Zhmurov about 1 year ago

  • Related to Feature #2888: CUDA Update and Constraints module added

#4 Updated by Artem Zhmurov about 1 year ago

  • Description updated (diff)

#5 Updated by Artem Zhmurov about 1 year ago

  • Description updated (diff)

#6 Updated by Artem Zhmurov 4 months ago

  • Description updated (diff)
  • Target version changed from 2020 to 2021-infrastructure-stable

Most of the features are done for 2020, the rest is bumped to 2021

#7 Updated by Artem Zhmurov 2 months ago

  • Status changed from New to Closed

Initial implementation and integration are done. Possible improvements moved to #3114.

Also available in: Atom PDF