One aspect of the good GPU performance is providing for fast asynchronous data transfers, potentially overlapping with GPU compute.
With CUDA implementation we care about it rather much, having designed HostAllocationPolicy around cudaHostRegister to provide aligned and pinned host memory allocations.
With OpenCL, it seems we didn't care so far, as we only have pmalloc() making plain 16 byte-aligned allocations and a meager TODO to at least use 4k pages the way CUDA supposedly does.
Looking around on the internet, it seems that OpenCL works in terms of mapped memory instead of pinned memory, so one is expected to manage both the host allocation and a corresponding device-side cl_mem buffer, e.g. by calling clCreateBuffer (https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateBuffer.html) with CL_MEM_USE_HOST_PTR flag and then using clEnqueueMapBuffer (https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueMapBuffer.html) for producing the new mapped/"pinned" host-side pointer.
One description is here in a very old NVIDIA OpenCL best practices guide in 3.1.1:
There are more discussions on the internet, whether to use CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR, when to call map/unmap, etc.
But to reiterate the core issue, our current HostAllocationPolicy, as name implies, works for CUDA pinning, but is likely not fit to accommodate proper host/device OpenCL memory handling.
With PME OpenCL, I will have to sidestep this design problem.
Ensure PME with OpenCL does not attempt to pin
Host-only memory pinning was designed with CUDA in mind, while OpenCL
requires managing both host and device memory buffer for efficient
mapping, which is not yet implemented.
This change teaches the PME module to understand what pinning policy
is appropriate to the build configuration, so that the setup of data
structures in various parts of the code can use a pinning policy that