Task #2469
implement GPU timer reduction for reporting
Description
Unlike in CUDA where device-side timing has severe limitations (can't time reliably with multiple streams), in OpenCL the same restrictions do not apply. While we have all facilities to correctly time GPU-side execution, the reporting is missing the accumulation/reduction.
A simple reduction across ranks should allow unrestricted reporting.
Related issues
History
#1 Updated by Szilárd Páll almost 3 years ago
- Related to Bug #2468: incorrect GPU timing reported with OpenCL and domain decomposition added