bonded GPU kernel fusion
The launch overhead of the bonded kernels often becomes so significant that it outweighs any benefit of GPU offload. This could be mitigated with a few optimizations: most importantly kernel fusion.Multiple approaches are possible:
- a simple decomposition of different types of bonded interactions over different blocks
- a more locality aware-decomposition would however be beneficial: interactions that share the coordinates computed in the same block.
Additionally, update groups could also be implemented in the decomposition (e.g. through block sorting which should not be difficult if coordinate ranges are already the unit of work) to prioritize work on the critical path for staggered update as well as for DD runs.
#4 Updated by Szilárd Páll 17 days ago
- kernels are severely memory/instruction limited mostly due to scattered coordinate loads and to a lesser extent non-vecorizable parameter loads
- due to alignment parameter loads can't always be vectorized (and even when they can, for adh on a 1070 for angles it did not help)
- the best approach would be efficiently loading coordinates but if time doesn't allow
- at least simple fusion within or across blocks should be implemented -- both would greatly improve cache reuse;
- fusion within blocks may be more efficient if there is other work to overlap with but how to map different interaction types to warps to avoid leaving lots of threads idle is not clear.