PME OpenCL kernels currently have additional synchronisation points, as compared to CUDA ones.
Some of those barriers should probably depend on minimal execution width (e.g. subgroup size?).
It might also be that some are needed at all. The purpose of this issue is to track all of them.
Relaxing any barrier requires rerunning Ewald unit tests on all supported and relevant platforms.
Hence it is probably beneficial achieve correctness on Intel GPUs first and only then start changing the barriers.
Relax OpenCL gather kernel barrier on AMD
Not needed on arch with >32 execution width.
Fix OpenCL gather reduction
On >=16-wide execution it is correct (narrower is checked and excluded
TODO: Consider changing the default on NVIDIA & Intel where offloading
PME is generally not advantageous to performance.