Feature #2817

Updated by Szilárd Páll about 2 years ago

Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (@nbnxn_atomdata_add_nbat_f_to_f@ and @nbnxn_atomdata_copy_x_to_nbat_x@).

Role and scope:
* Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
* Similarly, a the force layout transform kernel: it will allow direct force communication. Multiple flavors and implementation strategies to be considered:

only transform (e.g. if no other force compute on the GPU)

transform + accumulate (accumulate with other force compute outputs)

consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.

Related TODOs:

* need to improve resolve ownership of GPU input/outputs
* pinning for currently not pinned/pinnable search data