Project

General

Profile

Task #2498

Updated by Aleksei Iupinov over 2 years ago

One aspect of the good GPU performance is providing for fast asynchronous data transfers, potentially overlapping with GPU compute.
With CUDA implementation we care about it rather much, having designed HostAllocationPolicy around cudaHostRegister to provide aligned and pinned host memory allocations. allocations .
With OpenCL, it seems we didn't care so far, as we only have ocl_pmalloc() making plain simple 16 byte-aligned allocations and a meager TODO to at least use 4k pages the way CUDA supposedly does.
Looking around on the internet, it seems that OpenCL works in terms runtime also expects the developer to relinquish some of mapped the control over memory instead allocation for the purpose of pinned memory, so one is expected pinning/being fast, as all the the workflows seems to manage require managing both the host allocation and ad a corresponding device-side cl_mem buffer, e.g. by calling buffer. One is expected to call clCreateBuffer (https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateBuffer.html) with CL_MEM_USE_HOST_PTR flag and then using use clEnqueueMapBuffer (https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueMapBuffer.html) for producing the new mapped/"pinned" synchronisation, or something like that. TODO: add more links.

In short, while I like our CUDA
host-side pointer.
One description is here in a very old NVIDIA OpenCL best practices guide in 3.1.1:
http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf
There are more discussions on
pinning and its C++ wrapping, the internet, whether same might be impossible to use CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR, when to call map/unmap, etc.
But to reiterate the core issue, our current HostAllocationPolicy, as name implies, works
design for CUDA pinning, but is OpenCL, and one would likely not fit have to accommodate proper host/device OpenCL memory handling. think in terms of host+device buffer pair.
With PME OpenCL, I will have to sidestep am merely sidestepping this design problem. problem for now, reusing ocl_pmalloc() implementation.

Back