Project

General

Profile

Task #2453

Updated by Aleksei Iupinov over 2 years ago

With porting PME from CUDA to OpenCL I'm first going with a dirty code with lots of duplication to see how to strike a balance between neatness and extensibility. Most of the host-side logic is quite easy to wrap to look the same in CUDA/OpenCL since there is no C++ limitations.

Functionality already achieved:
- PME Spline/spread and gather OpenCL kernels passing unit tests on NVIDIA, Intel and AMD GPUs (also, mixed mode PME tests passing there as well, except for Intel, where NBs are incorrect).
https://github.com/yupinov/gromacs/tree/pme_opencl_dirty
- PME fully working on AMDGPU-PRO OpenCL, but broken with Rocm stack, only due to clFFT being broken (https://github.com/clMathLibraries/clFFT/issues/218)


TODO for correctness of the development branch:
- take a glance at performance - the first glance revealed not just FFT/solve, btu also sperad being 2.5x slower on Vega with AMDGPU-PRO - have to get to performacne counters eventually. implement solve kernel as well, using unit tests;
- test mroe on AMD/Intel, change warp_size==32 assumptions (the tests for spread/gather already pass with preferred widths 16 and 64, width of 8 is gonna need some small modifications in kernels);
- try importing and using clFFT, verifying correctness of the full PME OpenCL with PmeTest/regression tests;
- take a glance at performance.


TODO for clean submission into master branch: checklist
(which should eventually have gerrit links for everything)
(There is also probably much more stuff that I've forgotten)

Back