Feature #2054

Updated by Aleksei Iupinov over 2 years ago

This is a general issue to discuss and keep track of the PME GPU implementation progress.

PME for CUDA is in Gromacs 2018.

The current task is to implement integrate the current PME for OpenCL 1.2, and unify bunch of easily unifiable PME/NB CUDA/OpenCL GPU code on into the side.
Gromacs codebase.

The same original current PME GPU implementation has following restrictions:
CUDA restrictions will apply to the first OpenCL implementation:

PME order of 4 only - mostly a programming convenience and a sane default (even though it would be fun to change spread/gather kernel assumptions and logic, to try out order of 8). only.
2) 3) No PME decomposition (only a single process can run the whole PME GPU task, either with or without NB) - can be changed in a separate project. NB).
3) 4) Single precision only.
No free energy (~no multiple grids - not a difficult thing to implement). grids).
4) 6) No Lennard-Jones Lennard Jones PME (~no multiple grids + no LJ solver).

The rough plan of the patch sequence is:

5) Single precision only (pretty much a given with GPUs 1) PME GPU/CUDA framework patch. That includes the PME CUDA timings, PME cuFFT wrappers, PME GPU/CUDA data sructures, various intialization and memory management functions.
2 - 4) Separate patches for PME GPU stages. Each patch is to include
the approximate computation that is GPU kernel(s), their host launch code, and Google unit tests for comparing the GPU and CPU counterparts of PME method).

Additionally, OpenCL-specific implementation will at first have the warp size being fixed at 32
where possible.
(while it should be trivial I imagine these to relax it to 16 or multiples of 32).

There should also
be a checklist per-file basis, so,
-- - GPU counterpart
of CUDA/OpenCL wrapper changes here, that will facilitate porting spread_on_grid(); spline, spread and wrap kernels.
-- - GPU counterpart of solve_pme_yzx(); solve kernel.
-- - GPU counterpart of gather_f_bsplines(); unwrap and gather kernels.
5) The main PME GPU function launching all
the code without too much duplication.

stages, and the PME GPU launch calls where needed; some regression tests; handling the command line ("-pme=cpu/gpu/auto").

Other broad issues that are relevant and connected to PME, connected, to keep in mind:
1) Reworking the GPU/device assignment.
2) Rethinking the GPU task scheduling.
3) The input/output data formats and providers both on CPU and GPU (common GPU data framework, conversion kernel for NB, pagelocked allocator for host pointers...).