Task #2792: Improvement of PME gather and spread CUDA kernels
apply maintainability updates across all GPU kernels
Before the CUDA and OpenCL GPU device code greatly diverges, in order to improve maintainability, we need to refactor/introduce in the OpenCL kernels the simplest (form of the) changes made in the CUDA PME kernels:
add threadsPerAtom=16 constant
- move spline calculation to separate clh file
- introduce the c_recalculateSplines paths in spread/gather
- introduce c_useAtomDataPrefetch code-paths
The items above are the low-effort maintenance needs to avoid major code divergence. Implementing threadsPerAtom=4 can be considered separately as in contrast with the above that's mostly code-path (re)organization, it is a new feature to be implemented.