PME on GPU
This is a general issue to discuss and keep track of the PME GPU implementation progress.
PME for CUDA is in Gromacs 2018.
The current task is to implement PME for OpenCL 1.2, and unify bunch of easily unifiable PME/NB CUDA/OpenCL code on the side.
The same original PME CUDA restrictions will apply to the first OpenCL implementation:
1) PME order of 4 only - mostly a programming convenience and a sane default (even though it would be fun to change spread/gather kernel assumptions and logic, to try out order of 8).
2) No PME decomposition (only a single process can run the whole PME GPU task, either with or without NB) - can be changed in a separate project.
3) No free energy (~no multiple grids - not a difficult thing to implement).
4) No Lennard-Jones PME (~no multiple grids + no LJ solver).
5) Single precision only (pretty much a given with GPUs and the approximate computation that is PME method).
Additionally, OpenCL-specific implementation will at first have the warp size being fixed at 32
(while it should be trivial to relax it to 16 or multiples of 32).
There should also be a checklist of CUDA/OpenCL wrapper changes here, that will facilitate porting the code without too much duplication.
Other broad issues that are relevant and connected to PME, to keep in mind:
1) Reworking the GPU/device assignment.
2) Rethinking the GPU task scheduling.
3) The input/output data formats and providers both on CPU and GPU (common GPU data framework, conversion kernel for NB, pagelocked allocator for host pointers...).
PME spline+spread CUDA kernel and unit tests
The CUDA implementation of PME spline computation and charge spreading
for PME order 4 is added in pme-spread.cu.
The unit tests for PME CPU spline/spread stages
(e8cf7c0) are also extended to work with
the PME CUDA kernel, using the same reference data.
The tests iterate over all CUDA GPUs which are compatible with Gromacs.
PME force gathering - CUDA kernel + unit tests
The CUDA implementation of PME force gathering for PME order 4 is added
in pme-gather.cu. The unit tests for PME CPU force gathering
(d20a5d36) are extended to work with the CUDA kernel, using
the same reference data. The tests iterate over all Gromacs-compatible
PME solving - CUDA kernel + unit tests
The CUDA implementation of PME solving is added in pme-solve.cu.
The unit tests for PME CPU solving are extended to work with the CUDA kernel,
using the same reference data.
The CUDA solver supports 2 grid dimension orders: YZX and XYZ
(unlike the CPU one which only supports YZX). This is also tested.
Lennard-Jones solving is not implemented.
The tests iterate over all Gromacs-compatible CUDA GPUs.
Add calls to the PME GPU stages
This adds the inactive calls to PME GPU stages both for PP+PME
and PME-only ranks.
#2 Updated by Aleksei Iupinov almost 3 years ago
There are a couple of PME CPU unit test patches sitting idle in Gerrit.
These are https://gerrit.gromacs.org/6251/ and https://gerrit.gromacs.org/6337.
I would like to get these in sooner rather than later,
as the GPU spline computation/spreading patch involves the unit test which builds both on those and on the main PME GPU patch https://gerrit.gromacs.org/6212 as well.
Don't be discouraged by their sizes - most of that is just generated reference data, the actual code is ~500 lines in each.
#5 Updated by Szilárd Páll almost 3 years ago
- Difficulty hard added
- Difficulty deleted (
- PME-GPU user-interface (command line, manual device assignment, log reporting, etc.)
- user documentation + examples
- testing on multiple generation of devices (CC 2.0?)
- testing with multiple CUDA releases
- performance evaluation (at least to determine the range of use-cases where it makes sense to use a GPU for PME)
#12 Updated by Aleksei Iupinov over 2 years ago
I would like to urge everyone to review the low-level PME GPU building blocks:
https://gerrit.gromacs.org/#/c/6357/ (spreading kernel)
https://gerrit.gromacs.org/#/c/6459/ (solving kernel)
https://gerrit.gromacs.org/#/c/6437/ (gathering kernel)
https://gerrit.gromacs.org/#/c/6212/ (the data structures, their management, cuFFT calls) - this one is large and already has some renaming/cleanup TODOs, which would be much easier to resolve when all these 4 changes are merged in.
There is more work to do if you look at https://gerrit.gromacs.org/#/q/topic:pme, so I suggest that these components are reviewed first - they have been sitting there for a while, and I think they would do more good being tested by users of the master branch with included unit-tests.
Note that there is a GPU-task assignment change https://gerrit.gromacs.org/#/c/6205/ which sits at the bottom of the PME GPU branch (and in hindsight probably should have started higher), so I would appreciate more reviews on that as well. Otherwise, rebasing the core PME GPU changes once they're reviewed to skip it should be trivial.
#18 Updated by Szilárd Páll about 2 years ago
Aleksei Iupinov wrote:
OK, I'm just not familiar with issue tracking logic - if we implement e.g. coordinate conversion kernel, or PME GPU decomposition, and make an issue for that, is it alright for it to have a closed parent?
We targeted this feature for the next release. While some of the subtask did not materialize (but overall the feature was implemented), it might be cleaner to close this issue and continue with the few smaller remaining tasks. #2208 and #2240 should be possible to resolve on way or another.