Improvement of PME gather and spread CUDA kernels
There are some performance issues with the CUDA gather and spread kernels
- The spread kernel is limited by global memory instruction throughput (i.e. the hardware units dealing with global memory are running at capacity)
- There is a memory latency issue for the initial read of the charges and positions into shared memory in the spread kernel. This would be the new problem once the global memory problems are fixed.
The obvious solution to the first part is to recalculate the spline coefficients and derivatives in the gather kernel rather than writing them into global memory and reloading them.
For the second problem, the atom charges and positions are loaded into shared memory because of the way the spline coefficients are calculated. The number of calculations is DIM*atomsPerBlock*Order (3*2*4=24 currently). This is then spread over either 6 threads (with PME_GPU_PARALLEL_SPLINE=0) or 24 threads (with PME_GPU_PARALLEL_SPLINE=1). The way the atoms are allocated to the threads is different to the spread charges which is why the charges and positions need to be in global memory currently.
If the thread ordering of the spline calculations is changed so that the same threads deal with just one atom both when calculating the spline coefficients and when spreading the charges the load to shared memory could be eliminated. This would potentially make going to higher orders more complicated, but assuming order=4 is all we need for a reasonably long time seems like a safe assumption.
Using a larger spline order allows for a smaller PME grid size so would be most appropriate for larger systems, and at that point it should be appropriate to run with fewer threads per atom. STMV with 1,000,000 atoms for example has around 75 waves on a V100 with order*order (16) threads per atom, so instead using 1 thread per atom and 5 waves should be reasonable, and make using higher order splines simpler. For smaller sizes this is less useful because of the tail effects.
In the short term changing the code so that PME_GPU_PARALLEL_SPLINE=1 is turned on (which also requires two additional syncwarps after the shared memory operations with the Volta threading model) helps with the examples I have. It was not obvious to me why PME_GPU_PARALLEL_SPLINE=0 helps in any situation currently. The spline work is gated by localCheck which for PME_GPU_PARALLEL_SPLINE=0 requires orderIndex (= threadWarpIndex / (atomsPerWarp * DIM) ) to be 0, so only 6 threads per warp are active with DIM=3 and atomsPerWarp=2. The alternative is 24 active threads in the warp (with order index 0-3). So the penalty for PME_GPU_PARALLEL_SPLINE=1 is you do some redundant work and use 24 threads in the warp rather than just 6, which seems unimportant. The advantage is that all 24 threads write to global memory at the same time, which improves coalescing compared to 6 threads writing 4 times each. This seems like a simple change that for my runs on Cellulose on a V100 decreased the spread kernel time from 470 ns to 400 ns (using a separate PME GPU).
#1 Updated by Jonathan Vincent 14 days ago
Optimisation at https://gerrit.gromacs.org/8908 this is the larger change, removing the saving of the co-efficients in the spline and spread kernel (recalculating in gather), and other changes.
Kernel timings using 4 gv100.
nvprof gmx mdrun -ntmpi 4 -ntomp 10 -pme gpu -nb gpu -pin on -nsteps 1000 -v -npme 1 -notunepme
Note Villin only done on 2 GPUS (one PP one PME) as was not valid DD with 3 PP ranks.
villin rnase Dodec ADH Dodec Cellulose STMV 2018.4 Spline and sp 8.8260us 22.638us 111.22us 401.18us 1260.2us Gather 3.7310us 7.1570us 35.408us 132.75us 364.36us Modified Spline and sp 7.4820us 19.077us 91.102us 317.24us 960.90us Gather 6.0960us 8.6500us 41.488us 150.82us 316.12us
The spline and spread kernel is limited by global memory instruction throughput. The gather kernel is limited by memory latency.
Generally improvement is larger for larger problems. Villin gather time likely hurt by using 4 threads per atom rather than 16, which would create a tail effect.
#2 Updated by Jonathan Vincent 12 days ago
Ok added some timings for turning on PME_GPU_PARALLEL_SPLINE=1 as is done in https://gerrit.gromacs.org/#/c/8921/
This is a draft change. For Villin this is better than the above. rnase and ADH it is approximately the same in total time for both kernels. Cellulose and STMV are better with the more complicated change.
villin rnase Dodec ADH Dodec Cellulose STMV 2018.4 Spline and sp 8.8260us 22.638us 111.22us 401.18us 1260.2us Gather 3.7310us 7.1570us 35.408us 132.75us 364.36us modified Spline and sp 7.7690us 20.076us 96.995us 349.99us 1087.7us Gather 3.8780us 7.2840us 35.782us 133.10us 363.77us