PME GPU tuning
Currently PME kernels are preferring block sizes which were set long ago (c_[spread/gather/solve]MaxWarpsPerBlock).
These should be specialized for OpenCL. Actually, could be looked at again for CUDA as well. Actually, it should be a reocurring pre-release task, not just for PME :-) As long as anyone steps up.