Project

General

Profile

Task #2402

Task #2464: GPU performance goals overview

PME kernels general performance improvements

Added by Aleksei Iupinov over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Difficulty:
hard
Close

Description

This is a list of some things to try TODO to improve performance. Some ideas are very orthogonal, some are not.

Most of those are for spline/spread kernel, which is a main offender as it suffers from being memory-bound on global grid atomic writes on any decent size input.
Gather is also affected whenever we talk about spline parameters theta/dtheta.
Solve we don't care much about, as it's dependent on the (not so large) grid size and is mostly compute-bound.

- improve horrible non-obvious spline parameter indexing (same in gather as well):
Now it is at least isolated in 2 inline functions after commit c00062d0 in master, so it should be easier to change.
The spline parameter layout is described in ewald/pme.cuh.
It can likely be simplified, as long as the global writes to gm_theta/dtheta are still coalesced.
It is related to not contiguous thread layout used to computethe spline parameters.

- use more shuffling for spline data redistribution among threads - quite clear and related to the idea above.

- try moving theta/dtheta global writes to the end of kernel:
Currently these spline parameters are written out to global memory before spreading charges on the grid.
Both theta and dtheta are used in the gather kernel. Only theta is used in spreading part, so it is preserved in the shared memory of the combined spline/spread kernel.
It should possible to move at least global theta writes till after the spreading is done, to try to overlap some compute with the atomic spreading bottleneck.

- consider various degrees of spline parameter redundant computation:
One example: currently disabled PME_GPU_PARALLEL_SPLINE define, which determines whether 1 or order(==4) threads compute spline parameters.
Another example: dthetas are derived from thetas and then only used in gather - should those be computed and stored in the spline/spread kernel at all.
Extreme example as suggested by Szilard: do not store splines in global memory at all, just call compute_splines() again before the gather kernel.

- presort the particles approximately on CPU to minimize atomic clashes (or even index them indirectly on GPU, as a first test approach).

- implement drastically different spread kernel architecture to work per grid line instead of per atom, to provide the foundation for potentially using the fused inline 1/2D FFTs on the grid slabs.

- try 64bit atomics (also relevant to OpenCL).

Small things:

- tables: try paired load with a type cast (currently it's 2 fetches with same index in compute_splines()).


Related issues

Related to GROMACS - Task #2453: PME OpenCL porting effortResolved

History

#1 Updated by Mark Abraham over 1 year ago

Spread is memory bound while reading from particles or writing to grid points? As I've noted before, there's no reason to expect that the fastest implementations of these operations to look similar in code. One is a gather from particles to grid points, and one is a gather from grid points to particles.

#2 Updated by Szilárd Páll over 1 year ago

Spread is memory bound because it stores spline coefficients to global memory in way that this becomes a major stall reason in the CUDA kernel. Otherwise it would be in most cases significantly less memory bound.

BTW, Aleksei, we have the separate spline kernel, right? So we can get an upper bound for how much can the current spread be improved by calling the separate spline kernel before gather and removing the global store from spread. Also, I don't recall if we checked whether there was any regime where the separate spline kernel was any better than the combined spread + spline?

#3 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)
  • Assignee deleted (Aleksei Iupinov)

There is a separate kernel, yes - only called in the unit tests currently. And we didn't explore those options.
I rewrote the issue description a bit, hopefully it's a bit clearer.

#4 Updated by Aleksei Iupinov over 1 year ago

  • Subject changed from PME kernels improvements to PME kernels genral performance improvements

#5 Updated by Aleksei Iupinov over 1 year ago

  • Subject changed from PME kernels genral performance improvements to PME kernels general performance improvements

#6 Updated by Szilárd Páll over 1 year ago

Aleksei Iupinov wrote:

- consider various degrees of spline parameter redundant computation:
One example: currently disabled PME_GPU_PARALLEL_SPLINE define, which determines whether 1 or order(==4) threads compute spline parameters.
Another example: dthetas are derived from thetas and then only used in gather - should those be computed and stored in the spline/spread kernel at all.
Extreme example as suggested by Szilard: do not store splines in global memory at all, just call compute_splines() again before the gather kernel.

Note that it may not actually be such an "extreme" optimization (especially given the cost of storing/loading them). (We've also checked and OpenMM seems to just recompute them too.)

- presort the particles approximately on CPU to minimize atomic clashes (or even index them indirectly on GPU, as a first test approach).

As discussed offline, let's do this experiment asap when the spread bottleneck is improved; suggest just setting up scaling runs w/wo sorting and checking where/if there is a crossover.

- try 64bit atomics (also relevant to OpenCL).

Not so much performance; it's more related to i) reproducible/deterministic mode ii) OpenCL / GPUs that do not support float atomics (suggest separate issue for this topic).

#7 Updated by Aleksei Iupinov over 1 year ago

  • Parent task set to #2464

#8 Updated by Aleksei Iupinov over 1 year ago

  • Related to Task #2453: PME OpenCL porting effort added

Also available in: Atom PDF