Project

General

Profile

Task #3187

Task #2792: Improvement of PME gather and spread CUDA kernels

Template updated PME kernels using threads per atom

Added by Jonathan Vincent about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

Currently we are templating using a bool to control between order (4) threads per atom and order*order(16) threads per atom. Currently only Order 4 is supported.

This could potentially allow for a cleaner interface.
It could also allow for more fine grained control. Given that the work per atom is order*order*order (64) a different power of two threads per atom might also be interesting for different sized problems.

Associated revisions

Revision 22118220 (diff)
Added by Jonathan Vincent about 1 month ago

Update PME CUDA spread/gather

Adds addtional templated kernels to the CUDA spread and
gather kernels. Allowing the use of 4 threads per atom instead of
16 and allowing the spline data to be recalculated in the spread
instead of saved to global memory and reloaded.

The combinations mean we have 4 different kernels that can be called
depending on which is most appropriate for the problem size and
hardware (to be decided heuritically). By default existing method is
used (16 threads per atom, saving and reloading of spline data).

Added an additional option to disable the preloading of charges and
coordinates into shared memory, and instead each thread would
deal with a single atom.

Removed the (currently disabled) PME_GPU_PARALLEL_SPLINE=1 code
path.

Refs #2792 #3185 #3186 #3187 #3188

Change-Id: Ia48d8eb63e38d0d23eefd755dcc228ff9b66d3e6

Also available in: Atom PDF