PME CUDA/OpenCL code is implemented with the hardcoded assumption of 16 threads per atom (PME_SPREADGATHER_THREADS_PER_ATOM).
This corresponds to spreading/gathering in 2 dimensions - one can search for assignments of ithy and ithz in the spread and gather kernel files.
This logic has to be changed to only use 1 dimension to support execution widths < 16, e.g. on Intel.
Changing assignments/loop code themselves should be easy, but expect more pitfalls :-)
Ensure minimum exec width of the PME OpenCL kernels
This change adds checks to make sure that we don't execute incorrect
kernels in the case of the rare event if the Intel OpenCL compiler
decides to generate spread or gather kernels for 8-wide execution.
Fix OpenCL gather reduction
On >=16-wide execution it is correct (narrower is checked and excluded
TODO: Consider changing the default on NVIDIA & Intel where offloading
PME is generally not advantageous to performance.