## Feature #3355

### New PME parallel (GPU) scheme

**Description**

The current DD code has separate PP and PME domain decompositions.

What could give better performance, especially on GPUs and is simpler is:

- send all local coordinates to PME rank, or pass on a combined PP/PME rank (GPU)

- spread local coordinates (GPU)

- do 1D-FFT along Z (GPU, could also be CPU)

- reduce grid overlap (CPU, maybe also GPU)

- transpose + 2D-FFT (CPU, maybe also GPU)

- solve (CPU, maybe also GPU)

- all operations above in reverse

The grid overlap is relatively small, it's pme_order-1 plus a few grid lines for particle diffusion over nstlist steps. We could precompute a conservative estimate of this or compute it while/after spreading.

Note that communication the grid usually has higher volume the communicating the particles+charges.

### History

#### #2 Updated by Szilárd Páll 8 months ago

Berk Hess wrote:

- do 1D-FFT along Z (GPU, could also be CPU)

- reduce grid overlap (CPU, maybe also GPU)

- transpose + 2D-FFT (CPU, maybe also GPU)

Related but independent tasks:

- implementing and testing a new mixed mode with only FFT on the CPU

- investigate spread fused with on-device FFT (possibly with some kind of grid decomposition)

If these sound useful, I can file redmine tasks -- at least for the former?

#### #3 Updated by Jonathan Vincent 8 months ago

Ok I have an initial version of this, for a 1D decomposition of the FFT grid.

I think I am missing something from the above. Right now I am doing it as follows

- Reduce grid overlap (sending the order-1 values from remote)
- 2D FFT in Y and Z
- Transpose to YZX (distributed over Y)
- 1D FFT in X
- Solve
- 1D FFT in X
- Transpose to XYZ (distributed over X)
- 2D FFT in Y and Z
- Bring in halo data for gather (bring in the order -1 halo data from remote)

Right now for simplicity I am just looking at a 1D decomposition of the FFT grid, and assuming everything evenly divides in both Y and X so we have the same grid sizes on all ranks at all times. There seems to be some existing code to deal with that complexity for the CPU side that I wanted to integrate with, which seemed better than making a lot of new code to deal with that case.

I was confused how you could do an FFT in Z before you get the remote data, for example if you have 20x20x20 split into 10x20x20 then the FFT x_0,y_0,z0->z_19 will depend on the data in x_11,y_0,z0->z_19 in the remote rank, but I could be missing something.

Will upload what I have as a WIP patch.