Task #2514
Feature #2054: PME on GPU
Task #2453: PME OpenCL porting effort
PME OpenCL reductions with intrinsics
Description
PME has 2 reduction stages: obligatory per-atom force reduction in gather, optional observable 7 global energy/virial components in solve.
PME CUDA kernels implement versions of reductions with shared memory or with faster CUDA shuffle intrinsics.
PME OpenCL kernels only implement shared (__local in terms of OpenCL) memory reductions.
It should be beneficial to implement versions of reductions with AMD/Intel/... intrinsics.
AMD intrinsics
High-level description:
https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
Szilard asking about OpenCL use:
https://github.com/RadeonOpenCompute/ROCm/issues/189#issuecomment-325780455
Builtin "docs":
https://github.com/llvm-mirror/clang/blob/master/include/clang/Basic/BuiltinsAMDGPU.def
History
#1 Updated by Szilárd Páll over 2 years ago
Note that same reduction optimization applies to the nonbondeds (at least for AMD for Intel Roland implemented it using Intel subgroup extensions).