GPU-accessed memory page-locking and page sizes
In order to have fast asynchronous transfers between host and CUDA GPUs, host memory buffers need to be page-locked (aligned with the memory page size + exposed to CUDA via cudaHostRegister() function).
There is a crude hack for PME GPU purposes, which aligns existing coordinates, charges and forces buffers: https://gerrit.gromacs.org/#/c/6578
There is also a tangent work by Mark: https://gerrit.gromacs.org/#/c/6552
No matter which code we would use, the important questions are whether and how we would want to use page-aligned memory conditionally (e.g. based on PME running on CPU/GPU), and could we face problems/limits otherwise
(consider a dozen of ranks each using a dozen of page-aligned buffers with whatever large page size).
One long-term solution would be having a page-locked memory provider object which would minimize the paged memory use.
#5 Updated by Szilárd Páll over 1 year ago
- Status changed from New to In Progress
Mark Abraham wrote:
I'm sure there's things to improve here moving forward!
Is there? I think most if not all use-cases are covered by the (still) so-called HostAllocator implementation. If there are remaining cases relevant for this release and current its use-cases, let's keep this issue, otherwise, I suggest focusing on what we want to do next and organize ideas/requirements on new issues for that we can target specifically.