This is a rough outline what would one have to do to make the PME work on multiple GPUs (using decomposition).
# As a minimal first step, PME spread/gather GPU (CUDA, and in the future, OpenCL with mostly same code) kernels should be made to work with decomposition:
* realGridSizePadded vector in the PME GPU structure should be updated to allow for (PME order - 1) overlap in decomposed dimensions;
* hardcoded wrapX/wrapY spread/gather kernel parameters should be set properly to reflect 1D/2D decomposition, the kernel code should already account for those;
* kernels should only work only with given (local) range/indices of atoms, preferably with ArrayRef;
* for correctness purposes only, grid overlap communication MPI calls from gmx_pme_do(), such as sum_fftgrid_dd(), could be reintroduced into PME GPU host code after spreading/before gathering.
* obviously, one would have to modify/disable the corresponding GPU assignment single PME rank safeguards/asserts.
With this, PME mixed mode (–nb gpu –pme gpu –pmefft cpu) should function correctly with multiple ranks (e.g. –npme x). PmeTest in mdrun-mpi-test can be enhanced to cover the correctness.
# To try to make PME GPU work not so slow, as the next step, one way I can think of is to:
* implement full grid all-to-all broadcast for all PME ranks after spreading, only once per step;
* run GPU FFT/solve redundantly on this full grid on all PME ranks - since our grid sizes are so small anyway (10^5 - 10^6 cells), we would not extract much more parallelism from those guys anyway. Also, next to no code modification.
CUDA-aware MPI might come into this, of course, and there might be better ways. It might be not beneficial at all, as compared to the mixed mode from step 1. Remains to be seen.