Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms
GPU X/F buffer ops
Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (
- Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
- Similarly, a the force layout transform kernel: it will allow direct force communication. Multiple flavors and implementation strategies to be considered:
- only transform (e.g. if no other force compute on the GPU)
- transform + accumulate (accumulate with other force compute outputs)
- consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.
- need to improve resolve ownership of GPU input/outputs
- pinning for currently not pinned/pinnable search data