GPU performance goals overview
This is an overview/place of dicussion of potential goals for GPU improvements in Gromacs.
We have 3 distinct cases, which we can prioritize differently.
- shared inputs to PME/NB (coordination layout transformation kernel/indexing functions; avoiding redundant H2D copy);
- shared outputs to PME/NB (force buffer layout transformation kernel/indexing functions; atomic reduction of output forces into same device-side buffer);
- potential incremental PME kernel improvements from #2402 would be relevant to the single GPU case the most.
- consider best way of communicating coordinates from PP ranks to the PME rank.
- CUDA-aware MPI is an option, but it needs a clean, viable code fallback.
- Can we use multiple contexts/GPUs within single rank instead? They are really trivial to implement for the testing purposes. Will there be benefit?
- evaluate pipelining the H2D coordinates copy and multiple spread launches on the PME rank. Would require one hopefully small change, which is also required for the PME GPU decomposition (#2463): teaching spread to work with different chunks of atom data. Here it would also happen in multiple streams, while accumulating onto same grid.
- the PME GPU decomposition is briefly described at #2463;
- short-term goal is mixed mode (spread/gather decomposition only)
- possible way to stay-on GPU is to do redundant cuFFT the whole grid (after spread) on multiple GPUs (possibly use GPUDirect for gathering grid to all GPUs).