Project

General

Profile

Task #2818

bonded GPU kernel fusion

Added by Szilárd Páll 3 months ago. Updated 17 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

The launch overhead of the bonded kernels often becomes so significant that it outweighs any benefit of GPU offload. This could be mitigated with a few optimizations: most importantly kernel fusion.

Multiple approaches are possible:
  • a simple decomposition of different types of bonded interactions over different blocks
  • a more locality aware-decomposition would however be beneficial: interactions that share the coordinates computed in the same block.

Additionally, update groups could also be implemented in the decomposition (e.g. through block sorting which should not be difficult if coordinate ranges are already the unit of work) to prioritize work on the critical path for staggered update as well as for DD runs.


Related issues

Related to GROMACS - Task #2694: bonded CUDA kernelsClosed
Related to GROMACS - Task #2675: bonded CUDA offload taskIn Progress

History

#1 Updated by Szilárd Páll 3 months ago

  • Related to Task #2694: bonded CUDA kernels added

#2 Updated by Szilárd Páll 3 months ago

  • Description updated (diff)

#3 Updated by Szilárd Páll 17 days ago

  • Related to Task #2675: bonded CUDA offload task added

#4 Updated by Szilárd Páll 17 days ago

Based on the past and recent look at current kernels, these are the current conclusions
  • kernels are severely memory/instruction limited mostly due to scattered coordinate loads and to a lesser extent non-vecorizable parameter loads
  • due to alignment parameter loads can't always be vectorized (and even when they can, for adh on a 1070 for angles it did not help)
  • the best approach would be efficiently loading coordinates but if time doesn't allow
  • at least simple fusion within or across blocks should be implemented -- both would greatly improve cache reuse;
    - fusion within blocks may be more efficient if there is other work to overlap with but how to map different interaction types to warps to avoid leaving lots of threads idle is not clear.

Also available in: Atom PDF