Task #2453

Updated by Aleksei Iupinov over 2 years ago

With porting PME from CUDA to OpenCL I'm first going with a dirty code with lots of duplication to see how to strike a balance between neatness and extensibility. Most of the host-side logic is quite easy to wrap to look the same in CUDA/OpenCL since there is no C++ limitations.

Functionality already achieved:
- PME Spline/spread and gather OpenCL kernels passing unit tests on NVIDIA, Intel and AMD GPUs (also, mixed mode PME tests passing there as well, except for Intel, where NBs are incorrect).
- PME fully working on AMDGPU-PRO OpenCL, but broken with Rocm stack, only due to clFFT being broken (

TODO for correctness of the development branch:
- take a glance at performance - the first glance revealed not just FFT/solve, btu also sperad being 2.5x slower on Vega with AMDGPU-PRO - have to get to performacne counters eventually. implement solve kernel as well, using unit tests;
- test mroe on AMD/Intel, change warp_size==32 assumptions (the tests for spread/gather already pass with preferred widths 16 and 64, width of 8 is gonna need some small modifications in kernels);
- try importing and using clFFT, verifying correctness of the full PME OpenCL with PmeTest/regression tests;
- take a glance at performance.

TODO for clean submission into master branch: checklist
(which should eventually have gerrit links for everything)
(There is also probably much more stuff that I've forgotten)