Feature #2054: PME on GPU
PME OpenCL porting effort
With porting PME from CUDA to OpenCL I'm first going with a dirty code with lots of duplication to see how to strike a balance between neatness and extensibility. Most of the host-side logic is quite easy to wrap to look the same in CUDA/OpenCL since there is no C++ limitations.
Functionality achieved (https://github.com/yupinov/gromacs/tree/pme_opencl_gerrit):
- PME OpenCL kernels passing unit tests on NVIDIA and AMD GPUs;
still broken fixed upstream in with RoCM 2.0.
check correctness on Intel;
- document and cleanup FIXMEs;
- HOST-SIDE STUFF:
- device buffer wrappers (allocate, deallocate, reallocate, copyToDevice, copyToHost, clear) - partially on gerrit, mostly done
- 4 kernel launch parameters encapsulated in a struct (grid size, block size, shared memory size, GPU stream) - mostly done
- GPU kernel launch unrolling wrapper with variadic arguments - mostly done
- host allocation/deallocation wrappers
- investigate OpenCL host C++ allocator possibilities
- Synchronization event wrapper class - mostly done
- split PME kernels and their launch functions - the latter can look the same, and OpenCL kernels are easier to treat in separate files
- conclude whether most of the common GPU host code should be in headers or in cpp_also_compiled_as_cu
- move over some relevant OpenCL helpers from NB into gpu_utils (or whatever it will be called by 2019)
- teach the OpenCL kernel compiler a folder path different from "mdlib/nbnxn_ocl"
- enhance all the PME google unit/overall tests to treat OpenCL gracefully
- consider retaining PME kernels in unit tests (60 MB VRAM per single test run = bad...)
- import bulk of clFFT
- make Gromacs link to system clFFT by default
- DEVICE-SIDE STUFF:
- deal with OpenCL (1.x only?) not handling DeviceBuffers in structs - currently done with a macro which either passes all the buffer separately into each device function, or assigns relevant pointers from the struct otherwise (like in CUDA)
- deal with kernel argument structure definition to exclude DeviceBuffers - for now in a device-side duplicate I just replaced those with char dummies of equivalent size, also added attribute((aligned(8))) on host/device
- deal with OpenCL 1.x not permitting templates - the template parameters can be wrapped in ugly macros, and replacement defines can be passed into the compiler
- consider naming rules for defines which replace template parameters/constexprs
- wrap fully similar things like threadIdx/get_local_id into helpers
- consider duplicating CUDA/OpenCL kernels fully - would keep the code comparable with a diff tool, extensbile, and allows to not use macro workarounds
- consider transforming single kernel source for CUDA/OpenCL before JIT, e.g. cut template parts out. Tempting, but a bit of a black magic - would allow encoding many of the porting workarounds though
- consider also keeping common inlineable device code in separate files, and most of the wrappers/ workarounds/macro bombs in different files?
- consider implementing OpenCL counterparts to CUDA shuffle reductions in solve/gather
Support persistent device context-derived data in PME tests
PME OpenCL will need to not recompile kernels for running each
unit test. With this in mind, a persistent PmeGpuProgram class
is tasked with GPU kernel setup, and passed around, using a typedef.
The purpose of the class is to hold the PME program data that should
only be set up once and live forever for the given device context.
PmeGpuProgramImpl structure is now tasked with managing
function pointers to the CUDA kernels' instances, and will later
be tasked with compiling OpenCL kernels.
#7 Updated by Aleksei Iupinov about 1 year ago
- Description updated (diff)
Update: gather kernel working as well as spread on AMD/Intel/NVidia
https://github.com/yupinov/gromacs/tree/pme_opencl_dirty (the code is a mess and likely won't even compile without OpenCL)
#8 Updated by Aleksei Iupinov about 1 year ago
Haven't updated this in a while. Some preliminary changes have already made their way into master branch, more to come.
Conclusion to whether we want most common GPU (host) code in a shared implementation header, or a source file that is made special (compiled by nvcc with CUDA) by cmake rules:
Whatever works. I have a working draft of such a cmake rule. If I finish it and it makes through code review - good. Otherwise we will have to resolve to 2 small CUDA/OpenCL source files including large shared implmenetation header. Note that NB still has to face this de-duplication problem as well.
The answer to the naming rules we have worked out with Szilard - given that OpenCL 1.2 doesn't allow for true constexprs, we agreed to allow using same c_usefulConstant syntax for defines (which will have to defined when compiling the OpenCL kernels). For run-time constants, which actually live in a constant GPU memory (together with the kernel arguments), we agreed to use cm_ prefix. I have to see whether I follow that everywhere in the PME GPU device-side code.
The answer to the last 3 checkboxes ticked, about possibilities of CUDA/OpenCL device duplication - I will duplicate all the kernel code. All the macro decoration is not worth it for now with OpenCL 1.2.
The exceptions still possibly to consider would be wrapping a few one-liners with 100% same meaning (e.g. threadIdx.y and get_local_id(1) to get index of the thread, __syncthreads() and barrier(CLK_LOCAL_MEM_FENCE) to synchronize block-local memory), as the checkbox above says. That still won't do much for reducing number of different lines though.
#10 Updated by Aleksei Iupinov about 1 year ago
DeviceBuffer functions are in master.
As noted in #2498, changingPinningPolicy will have to become more permissive for OpenCL allocations, one way or another, as there are no way to do one-sided host pinning in CUDA sense.
Still figuring out the way to handle a persistent PME GPU context data struct.