Feature #2817

Updated by Szilárd Páll 2 months ago

Implement the native to nbat/nbnxn layout transforms/"buffer ops" in the nbnxn_gpu module (@nbnxn_atomdata_add_nbat_f_to_f@ and @nbnxn_atomdata_copy_x_to_nbat_x@).

Role and scope:
* Implementing the coordinate X transform on the GPU will allow transferring only the native layout. While this may not make the code faster -- considering the CUDA API overheads and that the "extra" H2D transfer is typically overlapped --, it does remove the CPU form the critical path in nonbonded communication which is beneficial to scaling and direct GPU-GPU communication.
* Similarly, the force layout transform kernel will allow direct force communication. This transform can be combined with reduction across force buffers. Multiple flavors and implementation strategies to be considered:
- only transform (e.g. if no other force compute on the GPU)
- transform + reduce by accumulating into the f buffer output on the GPU (e.g. reduce the NB and the RVec f PME GPU output);
- transform + reduce multiple force buffers: reduce the result of force calculation outputs from different memory spaces (special forces on CPU, PME on separate GPU, etc.)
- the transform+reduce kernels can use simple or atomic accumulation into a reduced f output buffer; the former will require exclusive access to the target force buffer (need to wait for the completion of any kernel that produces forces into it) while the latter would only require a wait on the source force buffer(s) to be reduced into the target (e.g. GPU NB and/or CPU force buffer).
- consider inline transform function for on-the-fly transform within the nonbonded kernel; in particular for high parallelization the performance hit in the nonbonded kernel may be less than the cost of launching an extra kernel.

Related TODOs:
* need to improve resolve ownership of GPU input/outputs [WIP]
* -pinning pinning for currently not pinned/pinnable search data- data