Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms
CUDA version of LINCS
Adapt the LINCS constraints to work efficiently on CUDA-enabled GPUs.
A separate class that contains the logic.
- Reduction for the virial using shuffle.
- Many-GPU version.
- Free energy.
Ideas for kernel improvement:
- Use analytical solution for matrix A inversion (for small matrices of H-bonds constraints), inverted matrix itself can be reused rather than recomputed.
- Move more data to local/shared memory and try to get rid of atomics (at least on the device level).
- Use locality of coupled constraints better (maybe go from block-sync to warp-sync)
- Introduce mapping of thread id to both single constraint and single atom, thus designating Nth threads to deal with Nat <= Nth coupled atoms and Nc <= Nth coupled constraints.
Initial integration to the constraints test.
- Add bigger systems to test virial reduction and overall redistribution of constraints among threads.
- Generalization of tests for different platforms.
Current version of the code is in gerrit change 9193 (https://gerrit.gromacs.org/#/c/9193/).