Feature #3355

New PME parallel (GPU) scheme

Added by Berk Hess 12 months ago. Updated 11 months ago.

core library
Target version:


The current DD code has separate PP and PME domain decompositions.

What could give better performance, especially on GPUs and is simpler is:
- send all local coordinates to PME rank, or pass on a combined PP/PME rank (GPU)
- spread local coordinates (GPU)
- do 1D-FFT along Z (GPU, could also be CPU)
- reduce grid overlap (CPU, maybe also GPU)
- transpose + 2D-FFT (CPU, maybe also GPU)
- solve (CPU, maybe also GPU)
- all operations above in reverse

The grid overlap is relatively small, it's pme_order-1 plus a few grid lines for particle diffusion over nstlist steps. We could precompute a conservative estimate of this or compute it while/after spreading.

Note that communication the grid usually has higher volume the communicating the particles+charges.


#1 Updated by Berk Hess 12 months ago

  • Subject changed from New PME parallel (GPU) schemes to New PME parallel (GPU) scheme

#2 Updated by Szilárd Páll 12 months ago

Berk Hess wrote:

- do 1D-FFT along Z (GPU, could also be CPU)
- reduce grid overlap (CPU, maybe also GPU)
- transpose + 2D-FFT (CPU, maybe also GPU)

Related but independent tasks:
- implementing and testing a new mixed mode with only FFT on the CPU
- investigate spread fused with on-device FFT (possibly with some kind of grid decomposition)

If these sound useful, I can file redmine tasks -- at least for the former?

#3 Updated by Jonathan Vincent 11 months ago

Ok I have an initial version of this, for a 1D decomposition of the FFT grid.

I think I am missing something from the above. Right now I am doing it as follows

  • Reduce grid overlap (sending the order-1 values from remote)
  • 2D FFT in Y and Z
  • Transpose to YZX (distributed over Y)
  • 1D FFT in X
  • Solve
  • 1D FFT in X
  • Transpose to XYZ (distributed over X)
  • 2D FFT in Y and Z
  • Bring in halo data for gather (bring in the order -1 halo data from remote)

Right now for simplicity I am just looking at a 1D decomposition of the FFT grid, and assuming everything evenly divides in both Y and X so we have the same grid sizes on all ranks at all times. There seems to be some existing code to deal with that complexity for the CPU side that I wanted to integrate with, which seemed better than making a lot of new code to deal with that case.

I was confused how you could do an FFT in Z before you get the remote data, for example if you have 20x20x20 split into 10x20x20 then the FFT x_0,y_0,z0->z_19 will depend on the data in x_11,y_0,z0->z_19 in the remote rank, but I could be missing something.

Will upload what I have as a WIP patch.

Also available in: Atom PDF