bonded GPU kernel fusion
The launch overhead of the bonded kernels often becomes so significant that it outweighs any benefit of GPU offload. This could be mitigated with a few optimizations: most importantly kernel fusion.Multiple approaches are possible:
- a simple decomposition of different types of bonded interactions over different blocks
- a more locality aware-decomposition would however be beneficial: interactions that share the coordinates computed in the same block.
Additionally, update groups could also be implemented in the decomposition (e.g. through block sorting which should not be difficult if coordinate ranges are already the unit of work) to prioritize work on the critical path for staggered update as well as for DD runs.