Feature #2816: Device-side update&constraits, buffer ops and multi-gpu comms
GPU Halo Exchange
When utilizing multiple GPUs with PP domain decomposition, halo data must be exchanged between PP tasks. Currently, this is routed through the host CPUs: D2H transfer; buffer packing on CPU; host-side MPI; and same in reverse. This is true for both the position and force buffers, the process is analogous but position and force versions can be treated as separate patches.
Instead, we can transfer data directly between GPU memory spaces using GPU peer-to-peer communication. Modern MPI implementations are CUDA-aware and support this. The complication is that we need to pack buffers directly on the GPU. This is done by first preparing (on the CPU) a mapping of array indices involved in the exchange, which can then be used in GPU buffer packing/unpacking kernels. This mapping array only needs to populated and transferred during neighbor search steps.