investigate OpenCL + MPI
The DD + OpenCL is broken (only MPI with single device per physical node works) and needs investigation how to fix.
Fix multiple tMPI ranks per OpenCL device
The OpenCL context and program objects were stored in the gpu_info
struct which was assumed to be a constant per compute host and therefore
shared across the tMPI ranks. Hence, gpu_info was initialized once
and a single pointer pointing to the data used by all ranks.
This led to the OpenCL context and program objects of different ranks
sharing a single device get overwritten/corrupted by one another.
- MPI still segfaults in clCreateContext() with multiple ranks per node
both with and without GPU sharing, so no changes on that front.
- The AMD OpenCL runtime overhead with all hw threads used is quite
significant; as a short-term solution we should consider avoiding
using HT by launching less threads (and/or warning the user).
Fix multiple MPI ranks per node with OpenCL
Similarly to the thread-MPI case, the source of the issue was
the hardware detection broadcasting the outcome of GPU detection
within a node. The MPI platform and device IDs, OpenCL internal
entities, differ across processes even if both platform and device(s)
are shared. This caused corruption at context creation on all ranks
other than the first rank in the node (which did the detection).
This change disables the GPU data broadcasting for OpenCL with MPI.