Task #2792: Improvement of PME gather and spread CUDA kernels
Update Constant/Variable naming in the PME GPU kernels.
Additional variables and constants were required to support the 4 threads per atom implementation. A new naming scheme was created because of this. However the existing variable names were not updated to be consistent with this. This is because many of the existing variables are also used by the OpenCL code, so to minimise the work they were not updated.
For clarity we should have consistent naming between the 4 threads per atom variables and the 16 threads per atom variables. This will require changing the OpenCL code as well.
Update PME CUDA spread/gather
Adds addtional templated kernels to the CUDA spread and
gather kernels. Allowing the use of 4 threads per atom instead of
16 and allowing the spline data to be recalculated in the spread
instead of saved to global memory and reloaded.
The combinations mean we have 4 different kernels that can be called
depending on which is most appropriate for the problem size and
hardware (to be decided heuritically). By default existing method is
used (16 threads per atom, saving and reloading of spline data).
Added an additional option to disable the preloading of charges and
coordinates into shared memory, and instead each thread would
deal with a single atom.
Removed the (currently disabled) PME_GPU_PARALLEL_SPLINE=1 code