Task #2792

Improvement of PME gather and spread CUDA kernels

Added by Jonathan Vincent 4 months ago. Updated 3 months ago.

Target version:


There are some performance issues with the CUDA gather and spread kernels

  • The spread kernel is limited by global memory instruction throughput (i.e. the hardware units dealing with global memory are running at capacity)
  • There is a memory latency issue for the initial read of the charges and positions into shared memory in the spread kernel. This would be the new problem once the global memory problems are fixed.

The obvious solution to the first part is to recalculate the spline coefficients and derivatives in the gather kernel rather than writing them into global memory and reloading them.

For the second problem, the atom charges and positions are loaded into shared memory because of the way the spline coefficients are calculated. The number of calculations is DIM*atomsPerBlock*Order (3*2*4=24 currently). This is then spread over either 6 threads (with PME_GPU_PARALLEL_SPLINE=0) or 24 threads (with PME_GPU_PARALLEL_SPLINE=1). The way the atoms are allocated to the threads is different to the spread charges which is why the charges and positions need to be in global memory currently.

If the thread ordering of the spline calculations is changed so that the same threads deal with just one atom both when calculating the spline coefficients and when spreading the charges the load to shared memory could be eliminated. This would potentially make going to higher orders more complicated, but assuming order=4 is all we need for a reasonably long time seems like a safe assumption.

Using a larger spline order allows for a smaller PME grid size so would be most appropriate for larger systems, and at that point it should be appropriate to run with fewer threads per atom. STMV with 1,000,000 atoms for example has around 75 waves on a V100 with order*order (16) threads per atom, so instead using 1 thread per atom and 5 waves should be reasonable, and make using higher order splines simpler. For smaller sizes this is less useful because of the tail effects.

In the short term changing the code so that PME_GPU_PARALLEL_SPLINE=1 is turned on (which also requires two additional syncwarps after the shared memory operations with the Volta threading model) helps with the examples I have. It was not obvious to me why PME_GPU_PARALLEL_SPLINE=0 helps in any situation currently. The spline work is gated by localCheck which for PME_GPU_PARALLEL_SPLINE=0 requires orderIndex (= threadWarpIndex / (atomsPerWarp * DIM) ) to be 0, so only 6 threads per warp are active with DIM=3 and atomsPerWarp=2. The alternative is 24 active threads in the warp (with order index 0-3). So the penalty for PME_GPU_PARALLEL_SPLINE=1 is you do some redundant work and use 24 threads in the warp rather than just 6, which seems unimportant. The advantage is that all 24 threads write to global memory at the same time, which improves coalescing compared to 6 threads writing 4 times each. This seems like a simple change that for my runs on Cellulose on a V100 decreased the spread kernel time from 470 ns to 400 ns (using a separate PME GPU).


#1 Updated by Jonathan Vincent 3 months ago

Optimisation at this is the larger change, removing the saving of the co-efficients in the spline and spread kernel (recalculating in gather), and other changes.

Kernel timings using 4 gv100.
nvprof gmx mdrun -ntmpi 4 -ntomp 10 -pme gpu -nb gpu -pin on -nsteps 1000 -v -npme 1 -notunepme

Note Villin only done on 2 GPUS (one PP one PME) as was not valid DD with 3 PP ranks.

                        villin        rnase Dodec      ADH Dodec       Cellulose        STMV
2018.4    Spline and sp 8.8260us       22.638us        111.22us         401.18us       1260.2us
          Gather        3.7310us        7.1570us        35.408us        132.75us        364.36us
Modified  Spline and sp 7.4820us       19.077us         91.102us        317.24us        960.90us
          Gather        6.0960us        8.6500us        41.488us        150.82us        316.12us

The spline and spread kernel is limited by global memory instruction throughput. The gather kernel is limited by memory latency.

Generally improvement is larger for larger problems. Villin gather time likely hurt by using 4 threads per atom rather than 16, which would create a tail effect.

#2 Updated by Jonathan Vincent 3 months ago

Ok added some timings for turning on PME_GPU_PARALLEL_SPLINE=1 as is done in

This is a draft change. For Villin this is better than the above. rnase and ADH it is approximately the same in total time for both kernels. Cellulose and STMV are better with the more complicated change.

                       villin        rnase Dodec      ADH Dodec       Cellulose        STMV
2018.4    Spline and sp 8.8260us       22.638us        111.22us         401.18us       1260.2us
          Gather        3.7310us        7.1570us        35.408us        132.75us        364.36us
modified  Spline and sp 7.7690us       20.076us         96.995us        349.99us       1087.7us
          Gather        3.8780us        7.2840us        35.782us        133.10us        363.77us

Also available in: Atom PDF