start early PP work for first dimension of halo exchange
With at least 2D DD decomposition, work can be started (i.e. nonbonded or bonded kernels) on data received already after the communication along the first dimension completes.
This would improve communication / computation overlap especially at high parallelization and even when there is already overlap (e.g. GPUs) as the amount of local work may not be sufficient to hide the cost of all communication.