### Simplify PME solve reduction

Solve reduction (7 energy/virial components) in PME CUDA/OpenCL is written in a rather contrived way, and can be rewritten in a simpler way, e.g.

for (7 components) { put component into shared mem; reduce shared mem with a binary tree reduction (with barrier as long as reduction stride is larger than execution width); add component to global memory atomically on thread 0; }

It might be a bit slower than current approach, but would simplify the code.