use MPI non-blocking collectives to overlap pull comm
MPI non-blockign collectives should allow overlapping the costly collective COM pull communication with compute and in most cases side-step the issue of having to involve all ranks in the pull communication with external potentials.
#2 Updated by Szilárd Páll over 2 years ago
Having looked at the code, we will need to split
pull_calc_coms() and start the non-blocking collective early, probably after halo-exchange. For fallback, for tMPI and in case if we don't have or should not use non-blocking collectives, we should have the early communication done conditionally. The extra cost involved in splitting
pull_calc_coms() is getting cache misses twice when accessing atom coordinates.
#3 Updated by Szilárd Páll over 2 years ago
MPI_Iallreduce seems to require MPI 3.0. We'll definitely need to test what are the requirement for independent progress -- grabbing all hardware threads might actually prevent it. We might also want to use MPI_Ibarrier to hide the barrier cost and make notification async/non-collective.