Project

General

Profile

Task #2792

Improvement of PME gather and spread CUDA kernels

Added by Jonathan Vincent 8 months ago. Updated 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

There are some performance issues with the CUDA gather and spread kernels

  • The spread kernel is limited by global memory instruction throughput (i.e. the hardware units dealing with global memory are running at capacity)
  • There is a memory latency issue for the initial read of the charges and positions into shared memory in the spread kernel. This would be the new problem once the global memory problems are fixed.

The obvious solution to the first part is to recalculate the spline coefficients and derivatives in the gather kernel rather than writing them into global memory and reloading them.

For the second problem, the atom charges and positions are loaded into shared memory because of the way the spline coefficients are calculated. The number of calculations is DIM*atomsPerBlock*Order (3*2*4=24 currently). This is then spread over either 6 threads (with PME_GPU_PARALLEL_SPLINE=0) or 24 threads (with PME_GPU_PARALLEL_SPLINE=1). The way the atoms are allocated to the threads is different to the spread charges which is why the charges and positions need to be in global memory currently.

If the thread ordering of the spline calculations is changed so that the same threads deal with just one atom both when calculating the spline coefficients and when spreading the charges the load to shared memory could be eliminated. This would potentially make going to higher orders more complicated, but assuming order=4 is all we need for a reasonably long time seems like a safe assumption.

Using a larger spline order allows for a smaller PME grid size so would be most appropriate for larger systems, and at that point it should be appropriate to run with fewer threads per atom. STMV with 1,000,000 atoms for example has around 75 waves on a V100 with order*order (16) threads per atom, so instead using 1 thread per atom and 5 waves should be reasonable, and make using higher order splines simpler. For smaller sizes this is less useful because of the tail effects.

In the short term changing the code so that PME_GPU_PARALLEL_SPLINE=1 is turned on (which also requires two additional syncwarps after the shared memory operations with the Volta threading model) helps with the examples I have. It was not obvious to me why PME_GPU_PARALLEL_SPLINE=0 helps in any situation currently. The spline work is gated by localCheck which for PME_GPU_PARALLEL_SPLINE=0 requires orderIndex (= threadWarpIndex / (atomsPerWarp * DIM) ) to be 0, so only 6 threads per warp are active with DIM=3 and atomsPerWarp=2. The alternative is 24 active threads in the warp (with order index 0-3). So the penalty for PME_GPU_PARALLEL_SPLINE=1 is you do some redundant work and use 24 threads in the warp rather than just 6, which seems unimportant. The advantage is that all 24 threads write to global memory at the same time, which improves coalescing compared to 6 threads writing 4 times each. This seems like a simple change that for my runs on Cellulose on a V100 decreased the spread kernel time from 470 ns to 400 ns (using a separate PME GPU).

History

#1 Updated by Jonathan Vincent 7 months ago

Optimisation at https://gerrit.gromacs.org/8908 this is the larger change, removing the saving of the co-efficients in the spline and spread kernel (recalculating in gather), and other changes.

Kernel timings using 4 gv100.
nvprof gmx mdrun -ntmpi 4 -ntomp 10 -pme gpu -nb gpu -pin on -nsteps 1000 -v -npme 1 -notunepme

Note Villin only done on 2 GPUS (one PP one PME) as was not valid DD with 3 PP ranks.

                        villin        rnase Dodec      ADH Dodec       Cellulose        STMV
2018.4    Spline and sp 8.8260us       22.638us        111.22us         401.18us       1260.2us
          Gather        3.7310us        7.1570us        35.408us        132.75us        364.36us
Modified  Spline and sp 7.4820us       19.077us         91.102us        317.24us        960.90us
          Gather        6.0960us        8.6500us        41.488us        150.82us        316.12us

The spline and spread kernel is limited by global memory instruction throughput. The gather kernel is limited by memory latency.

Generally improvement is larger for larger problems. Villin gather time likely hurt by using 4 threads per atom rather than 16, which would create a tail effect.

#2 Updated by Jonathan Vincent 7 months ago

Ok added some timings for turning on PME_GPU_PARALLEL_SPLINE=1 as is done in https://gerrit.gromacs.org/#/c/8921/

This is a draft change. For Villin this is better than the above. rnase and ADH it is approximately the same in total time for both kernels. Cellulose and STMV are better with the more complicated change.

                       villin        rnase Dodec      ADH Dodec       Cellulose        STMV
2018.4    Spline and sp 8.8260us       22.638us        111.22us         401.18us       1260.2us
          Gather        3.7310us        7.1570us        35.408us        132.75us        364.36us
modified  Spline and sp 7.7690us       20.076us         96.995us        349.99us       1087.7us
          Gather        3.8780us        7.2840us        35.782us        133.10us        363.77us

#3 Updated by Jonathan Vincent 26 days ago

So there are some issues with the PME unit tests with the patch at https://gerrit.gromacs.org/8908.

With the changed method we are breaking some of the assumptions in the unit tests, e.g. the gather tests assume that the grid-line indices and spline coefficients will be read from memory, rather than recalculated from the atom positions and the grid. As the data is not internally consistent this is a problem as we do not reproduce the random splines and derivatives, and so the answer for the tests changes.

Potentially also there are problems with the other tests, in that what is tested for is not output anymore, so we will need a (presumably templated) path that does output the grid-line indices and splines etc to global memory so that they can be checked by the unit tests.

The simplest change for the gather would be to use the scatter values with computed splines as they would then be internally consistent and valid.

#4 Updated by Jonathan Vincent 19 days ago

One other question, is there a redmine for the reordering changes Berk was looking at?

Would be good to look at how they combine with the above better.

Also need to find where the performance crossover is between villin at 5k atoms and rnase at 20k atoms is, and how the reordering changes affect that. Which other hardware is this also important for?

#5 Updated by Szilárd Páll 13 days ago

Jonathan Vincent wrote:

With the changed method we are breaking some of the assumptions in the unit tests, e.g. the gather tests assume that the grid-line indices and spline coefficients will be read from memory, rather than recalculated from the atom positions and the grid. As the data is not internally consistent this is a problem as we do not reproduce the random splines and derivatives, and so the answer for the tests changes.

I suggest to update unit test to make the data internally consistent. Note that we have the OpenCL codepath too running the same tests so make sure to either change those kernels too or make sure that the tests work with both computed & stored as well as recomputed splines.

Potentially also there are problems with the other tests, in that what is tested for is not output anymore, so we will need a (presumably templated) path that does output the grid-line indices and splines etc to global memory so that they can be checked by the unit tests.

Sure, I suggest to go ahead with templating and add (back) a conditional store of the spline parameters, calling that flavor of the kernel in tests.

#6 Updated by Szilárd Páll 13 days ago

Jonathan Vincent wrote:

One other question, is there a redmine for the reordering changes Berk was looking at?

Would be good to look at how they combine with the above better.

There is no new change (Berk was looking into enabling DD sorting without DD), it's the same sorting that we've talked about in the past. There was no redmine I filed one: #3031; should I assign it to you?

Also need to find where the performance crossover is between villin at 5k atoms and rnase at 20k atoms is, and how the reordering changes affect that. Which other hardware is this also important for?

Same as before + all new hardware.

Also available in: Atom PDF