Project

General

Profile

Task #3188

Task #2792: Improvement of PME gather and spread CUDA kernels

re-enalble parallel spline calculation for #threads/atoms > 4

Added by Szilárd Páll about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

The recent changes to the spread/gather kernels removed the parallel spline calculation (formerly enabled with PME_GPU_PARALLEL_SPLINE macro) which would allow utilizing 3x4=12 (instead of 4) of the 16 threads instead letting most of them idle.

This should improve performance for small input sizes where the 16 threads/atom kernel is already faster (hence it will also be important for an efficient pipelined spread and PME decomposition); preliminary data based on the 2019 code-base supports this.

Associated revisions

Revision 22118220 (diff)
Added by Jonathan Vincent about 1 month ago

Update PME CUDA spread/gather

Adds addtional templated kernels to the CUDA spread and
gather kernels. Allowing the use of 4 threads per atom instead of
16 and allowing the spline data to be recalculated in the spread
instead of saved to global memory and reloaded.

The combinations mean we have 4 different kernels that can be called
depending on which is most appropriate for the problem size and
hardware (to be decided heuritically). By default existing method is
used (16 threads per atom, saving and reloading of spline data).

Added an additional option to disable the preloading of charges and
coordinates into shared memory, and instead each thread would
deal with a single atom.

Removed the (currently disabled) PME_GPU_PARALLEL_SPLINE=1 code
path.

Refs #2792 #3185 #3186 #3187 #3188

Change-Id: Ia48d8eb63e38d0d23eefd755dcc228ff9b66d3e6

History

#1 Updated by Szilárd Páll about 1 month ago

  • Subject changed from re-implement paralle spline calculation for #thrads/atoms > 4 to re-enalble paralle spline calculation for #thrads/atoms > 4

#2 Updated by Szilárd Páll about 1 month ago

  • Subject changed from re-enalble paralle spline calculation for #thrads/atoms > 4 to re-enalble paralle spline calculation for #threads/atoms > 4

#3 Updated by Jonathan Vincent about 1 month ago

  • Subject changed from re-enalble paralle spline calculation for #threads/atoms > 4 to re-enalble parallel spline calculation for #threads/atoms > 4

I had a better look at what PME_PARALLEL_SPLINE does and it does 3 things.

Uses 12 threads instead of 3 for the final dtheta calculation (i.e. just for calculating the final difference)
Uses 12 threads instead of 3 for writing the theta values to global memory
uses shared memory instead of an array for the theta values.

Implemented this at https://gerrit.gromacs.org/c/gromacs/+/14077

For 16 threads with save and reload by far the largest effect was the shared memory change at large sizes.

I also then tried the shared memory change for 4 threads and recalculation and it had limited effect.

Also available in: Atom PDF