Project

General

Profile

Task #2453

Feature #2054: PME on GPU

PME OpenCL porting effort

Added by Aleksei Iupinov over 1 year ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
Difficulty:
uncategorized
Close

Description

With porting PME from CUDA to OpenCL I'm first going with a dirty code with lots of duplication to see how to strike a balance between neatness and extensibility. Most of the host-side logic is quite easy to wrap to look the same in CUDA/OpenCL since there is no C++ limitations.

Functionality achieved (https://github.com/yupinov/gromacs/tree/pme_opencl_gerrit):
- PME OpenCL kernels passing unit tests on NVIDIA and AMD GPUs;
- clFFT still broken fixed upstream in with RoCM 2.0.

TODO:
- check correctness on Intel;
- document and cleanup FIXMEs;
- subtasks.


Checklist

  • HOST-SIDE STUFF:
  • device buffer wrappers (allocate, deallocate, reallocate, copyToDevice, copyToHost, clear) - partially on gerrit, mostly done
  • 4 kernel launch parameters encapsulated in a struct (grid size, block size, shared memory size, GPU stream) - mostly done
  • GPU kernel launch unrolling wrapper with variadic arguments - mostly done
  • host allocation/deallocation wrappers
  • investigate OpenCL host C++ allocator possibilities
  • Synchronization event wrapper class - mostly done
  • split PME kernels and their launch functions - the latter can look the same, and OpenCL kernels are easier to treat in separate files
  • conclude whether most of the common GPU host code should be in headers or in cpp_also_compiled_as_cu
  • move over some relevant OpenCL helpers from NB into gpu_utils (or whatever it will be called by 2019)
  • teach the OpenCL kernel compiler a folder path different from "mdlib/nbnxn_ocl"
  • enhance all the PME google unit/overall tests to treat OpenCL gracefully
  • consider retaining PME kernels in unit tests (60 MB VRAM per single test run = bad...)
  • import bulk of clFFT
  • make Gromacs link to system clFFT by default
  • DEVICE-SIDE STUFF:
  • deal with OpenCL (1.x only?) not handling DeviceBuffers in structs - currently done with a macro which either passes all the buffer separately into each device function, or assigns relevant pointers from the struct otherwise (like in CUDA)
  • deal with kernel argument structure definition to exclude DeviceBuffers - for now in a device-side duplicate I just replaced those with char[] dummies of equivalent size, also added attribute((aligned(8))) on host/device
  • deal with OpenCL 1.x not permitting templates - the template parameters can be wrapped in ugly macros, and replacement defines can be passed into the compiler
  • consider naming rules for defines which replace template parameters/constexprs
  • wrap fully similar things like threadIdx/get_local_id into helpers
  • consider duplicating CUDA/OpenCL kernels fully - would keep the code comparable with a diff tool, extensbile, and allows to not use macro workarounds
  • consider transforming single kernel source for CUDA/OpenCL before JIT, e.g. cut template parts out. Tempting, but a bit of a black magic - would allow encoding many of the porting workarounds though
  • consider also keeping common inlineable device code in separate files, and most of the wrappers/ workarounds/macro bombs in different files?
  • consider implementing OpenCL counterparts to CUDA shuffle reductions in solve/gather

Subtasks

Task #2498: OpenCL memory pinning/mappingNew
Task #2500: detect and allow linking external clFFT, or no clFFTClosedMark Abraham
Task #2514: PME OpenCL reductions with intrinsicsNew
Task #2515: clFFT RocM compatibility problemClosedSzilárd Páll
Task #2516: Support PME OpenCL execution width < 16NewAleksei Iupinov
Task #2519: Improve/remove PME OpenCL kernel barriersNew
Task #2520: Treat OpenCL kernel width more diligentlyNew
Task #2521: Implement alternating PME/NB wait for OpenCLNew
Task #2522: OpenCL context duplicationNew
Task #2527: Rename GpuEventSynchronizer to something more fitting (after mergin PME OpenCL)New
Task #2529: Improve test timeouts handlingClosedSzilárd Páll
Task #2531: Consider optimizing tabulated data access on GPUNew
Task #2532: enable queue priorities in OpenCLNew
Task #2535: consider compiling opencl fft kernels onceNew
Bug #2536: clFFT execution not timed in PMEClosedSzilárd Páll
Task #2537: Simplify PME solve reductionNewAleksei Iupinov

Related issues

Related to GROMACS - Task #2402: PME kernels general performance improvementsNew
Related to GROMACS - Task #2524: struct alignment/packing for OpenCL host & device codeNew

Associated revisions

Revision b2a95c76 (diff)
Added by Aleksei Iupinov over 1 year ago

Support persistent device context-derived data in PME tests

PME OpenCL will need to not recompile kernels for running each
unit test. With this in mind, a persistent PmeGpuProgram class
is tasked with GPU kernel setup, and passed around, using a typedef.
The purpose of the class is to hold the PME program data that should
only be set up once and live forever for the given device context.
PmeGpuProgramImpl structure is now tasked with managing
function pointers to the CUDA kernels' instances, and will later
be tasked with compiling OpenCL kernels.

Refs #2453, #2522

Change-Id: I85a01bfc92ec3a108825414b14e2be3731433c9a

Revision 442845fd (diff)
Added by Aleksei Iupinov over 1 year ago

Gather the PME GPU constants/macros in a single header

Those will be translated to defines by the OpenCL kernel compiler.

Refs #2453, #2528

Change-Id: I03062c908db6dad9bbc8c62accdc4707b03ff527

Revision a19dd7d5 (diff)
Added by Szilárd Páll 4 months ago

Fix OpenCL gather reduction

On >=16-wide execution it is correct (narrower is checked and excluded
during compilation).

TODO: Consider changing the default on NVIDIA & Intel where offloading
PME is generally not advantageous to performance.

Addresses part of #2519
Refs #2453 #2516

Change-Id: I24beaaeea096954ba32b3a80251945a9d82a3c05

History

#1 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#2 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#3 Updated by Aleksei Iupinov over 1 year ago

  • Subject changed from OpenCL PME porting effort to PME OpenCL porting effort
  • Description updated (diff)

#4 Updated by Aleksei Iupinov over 1 year ago

  • Parent task set to #2054

#5 Updated by Aleksei Iupinov over 1 year ago

  • Private changed from Yes to No

#6 Updated by Aleksei Iupinov over 1 year ago

  • Assignee set to Aleksei Iupinov
  • Target version set to 2019

#7 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

Update: gather kernel working as well as spread on AMD/Intel/NVidia

https://github.com/yupinov/gromacs/tree/pme_opencl_dirty (the code is a mess and likely won't even compile without OpenCL)

#8 Updated by Aleksei Iupinov over 1 year ago

Haven't updated this in a while. Some preliminary changes have already made their way into master branch, more to come.

Conclusion to whether we want most common GPU (host) code in a shared implementation header, or a source file that is made special (compiled by nvcc with CUDA) by cmake rules:
Whatever works. I have a working draft of such a cmake rule. If I finish it and it makes through code review - good. Otherwise we will have to resolve to 2 small CUDA/OpenCL source files including large shared implmenetation header. Note that NB still has to face this de-duplication problem as well.

The answer to the naming rules we have worked out with Szilard - given that OpenCL 1.2 doesn't allow for true constexprs, we agreed to allow using same c_usefulConstant syntax for defines (which will have to defined when compiling the OpenCL kernels). For run-time constants, which actually live in a constant GPU memory (together with the kernel arguments), we agreed to use cm_ prefix. I have to see whether I follow that everywhere in the PME GPU device-side code.

The answer to the last 3 checkboxes ticked, about possibilities of CUDA/OpenCL device duplication - I will duplicate all the kernel code. All the macro decoration is not worth it for now with OpenCL 1.2.
The exceptions still possibly to consider would be wrapping a few one-liners with 100% same meaning (e.g. threadIdx.y and get_local_id(1) to get index of the thread, __syncthreads() and barrier(CLK_LOCAL_MEM_FENCE) to synchronize block-local memory), as the checkbox above says. That still won't do much for reducing number of different lines though.

#9 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#10 Updated by Aleksei Iupinov over 1 year ago

DeviceBuffer functions are in master.
As noted in #2498, changingPinningPolicy will have to become more permissive for OpenCL allocations, one way or another, as there are no way to do one-sided host pinning in CUDA sense.
Still figuring out the way to handle a persistent PME GPU context data struct.

#11 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#12 Updated by Aleksei Iupinov over 1 year ago

  • Related to Task #2402: PME kernels general performance improvements added

#13 Updated by Szilárd Páll over 1 year ago

  • Related to Task #2524: struct alignment/packing for OpenCL host & device code added

#14 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '19' for Issue #2453.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~I85a01bfc92ec3a108825414b14e2be3731433c9a
Gerrit URL: https://gerrit.gromacs.org/7818

#15 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '6' for Issue #2453.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~I03062c908db6dad9bbc8c62accdc4707b03ff527
Gerrit URL: https://gerrit.gromacs.org/7861

#16 Updated by Mark Abraham over 1 year ago

There were some open questions about naming some constants at https://gerrit.gromacs.org/#/c/7924/5 before we submitted it. We should reconsider these once some more stuff is in.

#17 Updated by Mark Abraham about 1 year ago

Currently our task assignment will run PME tasks on an Intel OpenCL device, which segfaults on my laptop. Not sure where the issue is (it has once worked), but we need to at least revisit our defaults for task assignment.

#18 Updated by Paul Bauer 11 months ago

  • Status changed from New to Resolved

I resolved this for now because all the checkboxes have been ticked off, but please tell me if this should be retargeted instead.

#19 Updated by Paul Bauer 11 months ago

  • Target version changed from 2019 to 2020

retargeted because of the remaining subtasks

#20 Updated by Szilárd Páll 8 months ago

  • Description updated (diff)

#21 Updated by Szilárd Páll 8 months ago

  • Description updated (diff)

Also available in: Atom PDF