Improve GPU update-constraints module
- Merge Leap-Frog, LINCS and SETTLE kernels. The kernel launch overhead can be significant. The merge with SETTLE is trivial, LINCS will require kernel reorganization.
- Do not save intermediate coordinates if there is no constraints. Coordinates before and after the update are needed simultaneously when constraints are evaluated. Hence, the coordinates before the update should be kept only when there are constraints in the system.
- LINCS kernel: Warp-level synchronization for coupled constraints. Consider inserting gaps at the end of the warp, not at the end of the block. This way no constrain will connect atoms from different warps, hence, warp-level synchronization can be used.
- LINCS kernel: Move more data to local/shared memory. Some intermediate values (e.g. matrix A) are saved into the global memory. Using shared memory would improve performance.
- LINCS kernel: Use analytical solution for the matrix A inversion? This can be beneficial with H-Bonds only constraints (matrix A is small).
- SETTLE kernel: Read only one index per water molecule. If the atoms are consequent in memory, only one index is needed.
- SETTLE kernel: Use different ordering for matrices. Some matrices can be transposed to make a better use of vector operations.
- Virial reduction: shuffle reduction.
- Virial reduction: reduce the number of atomic operations.
- Reconsider the naming of coordinates buffer in constraints. Currently, the variable naming inside the constraints is identical to CPU code: x is used for the coordinates before update, xp - to read intermediate and save final coordinates.This can be confucing when constructing the function call. Consider re-evaluating the naming or using different names to avoid the confusion with the CPU code.
- Use the same parameters and parameters initialization in GPU and CPU versions of SETTLE. Currently, the CPU and GPU versions use very similar data-structures and their initialization procedures are essentially duplicated. These should be separated into a new data-structure. (https://gerrit.gromacs.org/#/c/gromacs/+/10744/, https://gerrit.gromacs.org/#/c/gromacs/+/10758/).
- Reconsider naming. Use GPU suffix instead of CUDA, since there are plans to generalize the implementations to other platforms.
- Virial reduction. Make a proper unified virial reduction. Although it is not called on each step, this can improve performance if implemented properly (shuffle reduction, reduce number of atomics).
- Unify the PBC management. LINCS, SETTLE and Leap-Frog all use similar routines to manage the PBC data. Should be unified.
Multi-GPU support in LINCS.
- FEP in LINCS. Easy to implement, but tests should be improved yet again.
Eliminate D2D copy in update constraints
The intermediate coordinates (x' or xp) are only needed inside
the update-constraints module (for the constraints algorithms)
and never used outside. Hence, the xp variable can be used to
save the coordinates before update, while x stores the final
coordinates. This way, there is no need to make a D2D xp->x
copy after applying the constraints, since x will have the
#1 Updated by Szilárd Páll 4 months ago
Suggest splitting things: code quality improvements and performance improvements (unless the former results in the latter) would be best gathered separately. The latter should be a list with an order/prio markings because we want to know what are the low-hanging fruit optimizations that could have impact and may be doable before RC.