Project

General

Profile

Feature #2054

PME on GPU

Added by Aleksei Iupinov almost 3 years ago. Updated about 2 months ago.

Status:
Accepted
Priority:
High
Category:
mdrun
Target version:
Difficulty:
hard
Close

Description

This is a general issue to discuss and keep track of the PME GPU implementation progress.

PME for CUDA is in Gromacs 2018.
The current task is to implement PME for OpenCL 1.2, and unify bunch of easily unifiable PME/NB CUDA/OpenCL code on the side.
The same original PME CUDA restrictions will apply to the first OpenCL implementation:

1) PME order of 4 only - mostly a programming convenience and a sane default (even though it would be fun to change spread/gather kernel assumptions and logic, to try out order of 8).
2) No PME decomposition (only a single process can run the whole PME GPU task, either with or without NB) - can be changed in a separate project.
3) No free energy (~no multiple grids - not a difficult thing to implement).
4) No Lennard-Jones PME (~no multiple grids + no LJ solver).
5) Single precision only (pretty much a given with GPUs and the approximate computation that is PME method).

Additionally, OpenCL-specific implementation will at first have the warp size being fixed at 32
(while it should be trivial to relax it to 16 or multiples of 32).

There should also be a checklist of CUDA/OpenCL wrapper changes here, that will facilitate porting the code without too much duplication.

Other broad issues that are relevant and connected to PME, to keep in mind:
1) Reworking the GPU/device assignment.
2) Rethinking the GPU task scheduling.
3) The input/output data formats and providers both on CPU and GPU (common GPU data framework, conversion kernel for NB, pagelocked allocator for host pointers...).


Subtasks

Bug #2208: cuFFT linkingNew
Task #2240: GPU emulation mode support for PME missingAccepted
Bug #2303: PME GPU opt-in mixed mode is broken with PME tuningClosedAleksei Iupinov
Task #2453: PME OpenCL porting effortResolvedAleksei Iupinov
Task #2498: OpenCL memory pinning/mappingNew
Task #2500: detect and allow linking external clFFT, or no clFFTClosedMark Abraham
Task #2514: PME OpenCL reductions with intrinsicsNew
Task #2515: clFFT RocM compatibility problemClosedSzilárd Páll
Task #2516: Support PME OpenCL execution width < 16NewAleksei Iupinov
Task #2519: Improve/remove PME OpenCL kernel barriersNew
Task #2520: Treat OpenCL kernel width more diligentlyNew
Task #2521: Implement alternating PME/NB wait for OpenCLNew
Task #2522: OpenCL context duplicationNew
Task #2527: Rename GpuEventSynchronizer to something more fitting (after mergin PME OpenCL)New
Task #2529: Improve test timeouts handlingClosedSzilárd Páll
Task #2531: Consider optimizing tabulated data access on GPUNew
Task #2532: enable queue priorities in OpenCLNew
Task #2535: consider compiling opencl fft kernels onceNew
Bug #2536: clFFT execution not timed in PMEClosedSzilárd Páll
Task #2537: Simplify PME solve reductionNewAleksei Iupinov
Task #2538: organize more of the PME GPU code along task-specific linesNewMark Abraham
Task #2696: ensure PME queue is flushedIn ProgressSzilárd Páll

Related issues

Related to GROMACS - Task #2092: Tests running on GPU, and hardware assignmentNew
Related to GROMACS - Task #2124: PME GPU user interface suggestionsClosed
Related to GROMACS - Task #2053: refine notation in GPU codeNew
Related to GROMACS - Task #2524: struct alignment/packing for OpenCL host & device codeNew
Related to GROMACS - Task #3031: evaluate the impact of particle order on PMENew

Associated revisions

Revision 76c7a1a4 (diff)
Added by Aleksei Iupinov almost 2 years ago

PME spline+spread CUDA kernel and unit tests

The CUDA implementation of PME spline computation and charge spreading
for PME order 4 is added in pme-spread.cu.

The unit tests for PME CPU spline/spread stages
(e8cf7c0) are also extended to work with
the PME CUDA kernel, using the same reference data.
The tests iterate over all CUDA GPUs which are compatible with Gromacs.

Refs #2054, #2092.

Change-Id: If5ec49f030b9b94395db28fa454ea25c3efb05d1

Revision 4231cc37 (diff)
Added by Aleksei Iupinov almost 2 years ago

PME force gathering - CUDA kernel + unit tests

The CUDA implementation of PME force gathering for PME order 4 is added
in pme-gather.cu. The unit tests for PME CPU force gathering
(d20a5d36) are extended to work with the CUDA kernel, using
the same reference data. The tests iterate over all Gromacs-compatible
CUDA GPUs.

Ref #2054

Change-Id: I162e3a14cb9aa8ddeac17c5ad1ca709df72b8986

Revision 7bec7e1f (diff)
Added by Aleksei Iupinov almost 2 years ago

PME solving - CUDA kernel + unit tests

The CUDA implementation of PME solving is added in pme-solve.cu.
The unit tests for PME CPU solving are extended to work with the CUDA kernel,
using the same reference data.
The CUDA solver supports 2 grid dimension orders: YZX and XYZ
(unlike the CPU one which only supports YZX). This is also tested.
Lennard-Jones solving is not implemented.
The tests iterate over all Gromacs-compatible CUDA GPUs.

Refs #2054

Change-Id: Ic610e7f077f39a64089dd9b80df9905094b10459

Revision 2747fc48 (diff)
Added by Aleksei Iupinov almost 2 years ago

Add calls to the PME GPU stages

This adds the inactive calls to PME GPU stages both for PP+PME
and PME-only ranks.

Ref #2054

Change-Id: I5af2ab95cedff422c39592255f01205d42fc7eb7

Revision 4c2fe1e6 (diff)
Added by Magnus Lundborg 11 months ago

Check q perturbation when PME on GPU is tested

If charges are not perturbued allow running PME on the GPU in
FE simulations.

Refs #2054.

Change-Id: Ibc610cb63afaadf4aa97608b8e03b6906fe2d026

History

#1 Updated by Szilárd Páll over 2 years ago

I'd suggest creating subtasks to tack progress of what needs to be done.

#2 Updated by Aleksei Iupinov over 2 years ago

There are a couple of PME CPU unit test patches sitting idle in Gerrit.
These are https://gerrit.gromacs.org/6251/ and https://gerrit.gromacs.org/6337.
I would like to get these in sooner rather than later,
as the GPU spline computation/spreading patch involves the unit test which builds both on those and on the main PME GPU patch https://gerrit.gromacs.org/6212 as well.
Don't be discouraged by their sizes - most of that is just generated reference data, the actual code is ~500 lines in each.

#3 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '10' for Issue #2054.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~If5ec49f030b9b94395db28fa454ea25c3efb05d1
Gerrit URL: https://gerrit.gromacs.org/6357

#4 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related DRAFT patchset '2' for Issue #2054.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~I162e3a14cb9aa8ddeac17c5ad1ca709df72b8986
Gerrit URL: https://gerrit.gromacs.org/6437

#5 Updated by Szilárd Páll over 2 years ago

  • Difficulty hard added
  • Difficulty deleted (uncategorized)
Additional tasks that are blockers for the release:
  • PME-GPU user-interface (command line, manual device assignment, log reporting, etc.)
  • user documentation + examples
  • testing on multiple generation of devices (CC 2.0?)
  • testing with multiple CUDA releases
  • performance evaluation (at least to determine the range of use-cases where it makes sense to use a GPU for PME)

#6 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '4' for Issue #2054.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~Ic610e7f077f39a64089dd9b80df9905094b10459
Gerrit URL: https://gerrit.gromacs.org/6459

#7 Updated by Aleksei Iupinov over 2 years ago

  • Related to Task #2092: Tests running on GPU, and hardware assignment added

#8 Updated by Aleksei Iupinov over 2 years ago

  • Related to Task #2124: PME GPU user interface suggestions added

#9 Updated by Aleksei Iupinov over 2 years ago

  • Related to Task #2053: refine notation in GPU code added

#10 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related DRAFT patchset '16' for Issue #2054.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~I9e705b86d5aa07d59544de68234cdd6242ad1194
Gerrit URL: https://gerrit.gromacs.org/6472

#11 Updated by Aleksei Iupinov about 2 years ago

  • Blocked by Task #2183: GPU-accessed memory page-locking and page sizes added

#12 Updated by Aleksei Iupinov about 2 years ago

I would like to urge everyone to review the low-level PME GPU building blocks:

https://gerrit.gromacs.org/#/c/6357/ (spreading kernel)
https://gerrit.gromacs.org/#/c/6459/ (solving kernel)
https://gerrit.gromacs.org/#/c/6437/ (gathering kernel)
https://gerrit.gromacs.org/#/c/6212/ (the data structures, their management, cuFFT calls) - this one is large and already has some renaming/cleanup TODOs, which would be much easier to resolve when all these 4 changes are merged in.

There is more work to do if you look at https://gerrit.gromacs.org/#/q/topic:pme, so I suggest that these components are reviewed first - they have been sitting there for a while, and I think they would do more good being tested by users of the master branch with included unit-tests.

Note that there is a GPU-task assignment change https://gerrit.gromacs.org/#/c/6205/ which sits at the bottom of the PME GPU branch (and in hindsight probably should have started higher), so I would appreciate more reviews on that as well. Otherwise, rebasing the core PME GPU changes once they're reviewed to skip it should be trivial.

#13 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related DRAFT patchset '2' for Issue #2054.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~I5af2ab95cedff422c39592255f01205d42fc7eb7
Gerrit URL: https://gerrit.gromacs.org/6670

#14 Updated by Mark Abraham over 1 year ago

  • Status changed from New to Resolved

This is now implemented as intended

#15 Updated by Mark Abraham over 1 year ago

  • Blocked by deleted (Task #2183: GPU-accessed memory page-locking and page sizes)

#16 Updated by Szilárd Páll over 1 year ago

Mark Abraham wrote:

This is now implemented as intended

Yeah, suggest closing.

#17 Updated by Aleksei Iupinov over 1 year ago

OK, I'm just not familiar with issue tracking logic - if we implement e.g. coordinate conversion kernel, or PME GPU decomposition, and make an issue for that, is it alright for it to have a closed parent?

#18 Updated by Szilárd Páll over 1 year ago

Aleksei Iupinov wrote:

OK, I'm just not familiar with issue tracking logic - if we implement e.g. coordinate conversion kernel, or PME GPU decomposition, and make an issue for that, is it alright for it to have a closed parent?

We targeted this feature for the next release. While some of the subtask did not materialize (but overall the feature was implemented), it might be cleaner to close this issue and continue with the few smaller remaining tasks. #2208 and #2240 should be possible to resolve on way or another.

#19 Updated by Mark Abraham over 1 year ago

yeah we'll make new tasks (that can refer to this one) when we decide to do future work to add mroe functionality.

#20 Updated by Mark Abraham over 1 year ago

  • Status changed from Resolved to Accepted
  • Target version changed from 2018 to 2019

can't close this while sub tasks remain open, so retargeting

#21 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#22 Updated by Szilárd Páll about 1 year ago

  • Related to Task #2524: struct alignment/packing for OpenCL host & device code added

#23 Updated by Magnus Lundborg 11 months ago

Does anyone have any rough suggestions where to start implementing PME with free energy on GPU?

#24 Updated by Gerrit Code Review Bot 11 months ago

Gerrit received a related patchset '1' for Issue #2054.
Uploader: Magnus Lundborg ()
Change-Id: gromacs~master~Ibc610cb63afaadf4aa97608b8e03b6906fe2d026
Gerrit URL: https://gerrit.gromacs.org/8305

#25 Updated by Mark Abraham 8 months ago

  • Target version changed from 2019 to 2020

#26 Updated by Szilárd Páll 13 days ago

  • Related to Task #3031: evaluate the impact of particle order on PME added

Also available in: Atom PDF