PME OpenCL kernels currently have additional synchronisation points, as compared to CUDA ones.
Some of those barriers should probably depend on minimal execution width (e.g. subgroup size?).
It might also be that some are needed at all. The purpose of this issue is to track all of them.
Relaxing any barrier requires rerunning Ewald unit tests on all supported and relevant platforms.
Hence it is probably beneficial achieve correctness on Intel GPUs first and only then start changing the barriers.