PME has 2 reduction stages: obligatory per-atom force reduction in gather, optional observable 7 global energy/virial components in solve.
PME CUDA kernels implement versions of reductions with shared memory or with faster CUDA shuffle intrinsics.
PME OpenCL kernels only implement shared (__local in terms of OpenCL) memory reductions.
It should be beneficial to implement versions of reductions with AMD/Intel/... intrinsics.
Szilard asking about OpenCL use: