As discussed on https://gerrit.gromacs.org/#/c/7924/, we're currently treating OpenCL kernel execution width with CUDA-centric world view.
Function getWarpSize() queries CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE for a test kernel and returns 64 for AMD, 32 for NVidia accordingly.
We should likely be using vendor extensions (CL_DEVICE_WAVEFRONT_WIDTH_AMD from cl_amd_device_attribute_query, CL_DEVICE_WARP_SIZE_NV from cl_nv_device_attribute_query, CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE from cl_intel_subgroups) and only use CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a fallback. We should also be querying this "warp" size for each kernel separately and storing it.
Ensure minimum exec width of the PME OpenCL kernels
This change adds checks to make sure that we don't execute incorrect
kernels in the case of the rare event if the Intel OpenCL compiler
decides to generate spread or gather kernels for 8-wide execution.