Task #2453

Updated by Aleksei Iupinov over 2 years ago

With porting PME from CUDA to OpenCL I'm first going with a dirty code with lots of duplication to see how to strike a balance between neatness and extensibility. Most of the host-side logic is quite easy to wrap to look the same in CUDA/OpenCL since there is no C++ limitations.

Functionality achieved ( already achieved:
- PME OpenCL kernels passing unit tests on NVIDIA NVIDIA, Intel and AMD GPUs; GPUs (also, mixed mode PME tests passing there as well, except for Intel, where NBs are incorrect).
- PME fully working on AMDGPU-PRO OpenCL, but broken with Rocm stack, only due to clFFT still being broken with RoCM. (

TODO: TODO for correctness of the development branch:
- check correctness take a glance at performance - the first glance revealed not just FFT/solve, btu also sperad being 2.5x slower on Intel; Vega with AMDGPU-PRO - have to get to performacne counters eventually.
- document test mroe on AMD/Intel, change warp_size==32 assumptions (the tests for spread/gather already pass with preferred widths 16 and cleanup FIXMEs; 64, width of 8 is gonna need some small modifications in kernels);

TODO for clean submission into master branch: checklist

- subtasks. (which should eventually have gerrit links for everything)
(There is also probably much more stuff that I've forgotten)