Project

General

Profile

Task #3208

improve PP-PME tuning

Added by Szilárd Páll 11 months ago. Updated 9 months ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

PME tuning increasingly often picks a wrong setup in particular with fast-iterating runs and with new GPU paths in the 2020 release.

Use-cases that need attention:
  • small inputs / fast iterating where first measurements underestimate performance
  • GPU update path: need to sync the GPU stream before taking the cycle measurement
  • separate PME rank case with GPU update: the PP / PME imbalance can not be measured reliably, so the mechanism to turn conditionally turn on balancing needs to be disabled for this case.
adhd.tpr (7.06 MB) adhd.tpr Alan Gray, 12/17/2019 01:26 PM
cellulose.tpr (9.57 MB) cellulose.tpr Alan Gray, 12/17/2019 01:27 PM
pmetune.JPG (39.5 KB) pmetune.JPG Alan Gray, 12/17/2019 01:31 PM

Associated revisions

Revision 0bd020f3 (diff)
Added by Szilárd Páll 10 months ago

PP-PME load balancing improvements

Add a minimum number of nstlist tuning intervals and minimum time delay
at the beginning of the run before the load balancing starts. This allow
hardware clocks to ramp up and avoids having early measurements
overestimate rendering subsequent ones with different grid setups only
faster due to hardware warmup.
Also use global variables to adjust the number of measurements to be
skipped after switching configs.

Refs #3208
Fixes #2203

Change-Id: If835d2482e127caa51d50f45f25c19144d35efaa

Revision db9bd671 (diff)
Added by Alan Gray 10 months ago

Fix PME tuning issue when PME-PP communications are active

If PME-PP direct GPU communication is active, the CPU-GPU
asynchronicity in the codepath causes the timing load balance ratio
used to trigger PME tuning to be unreliable. This change
unconditionally starts PME tuning when the PME-PP communication
feature is active (unless PME tuning has been de-activated with
the -notunepme option).

Partially addresses #3208

Change-Id: Iea173d19062dbd11d57ca9ceb4e52b9f20d4ff15

History

#1 Updated by Berk Hess 10 months ago

What should we do? Add a fixed time delay before measuring times?

#2 Updated by Szilárd Páll 10 months ago

Berk Hess wrote:

What should we do? Add a fixed time delay before measuring times?

That would likely be enough (+re-tuning ideally) for the issues I previously observed with the suboptimal tuning results. I have however not looked into whether that is enough for the new code-paths. We might also need to increase the measurement interval with a single setup.

Enabling periodic re-tuning is likely quite important too to avoid long-term performance degradation, but I have not had time to look into it.

#3 Updated by Alan Gray 10 months ago

I looked into this a bit more and the main problem with the new code-paths is that we are no longer synchronizing with the GPU before taking cycle counter readings. In pme_gpu_wait_finish_task() we now have

 if (!pme->gpu->settings.useGpuForceReduction || haveComputedEnergyAndVirial)
    {
        pme_gpu_synchronize(pme->gpu);
    }

so the sync is not called when we have the force reduction on the GPU and the cycle counters are only including the CPU launch time. If I comment out this if statement, then PME-tuning kicks back in again for my Cellulose case (see https://redmine.gromacs.org/issues/2965), and performance increases from 103.3ns/day to 129.4ns/day. There may also be something similar happening on the PP side. So, I guess we need to make the sync(s) happen while PME tuning is active, or replace CPU-side timing with GPU-side timing using events.

#4 Updated by Alan Gray 10 months ago

https://gerrit.gromacs.org/c/gromacs/+/14622 adds syncs PME and PP CPUs to their CPUs when load balancing is active to allow the timing ratio to be correct, and syncs the PP CPU to its GPU after the update and constraints to make the ewcSTEP timer be correct.

Note from Szilard for further improvement:
"Additionally, we would benefit from removing the reliance on ewcSTEP (and e.g. instead record the timing between two balancing calls inside pme_loadbal_do(). This would allow moving the sync to the beginning of the search step and avoid cross-step dependencies."

#5 Updated by Alan Gray 10 months ago

The problem is, in pme_loadbal_do():

  /* If PME rank load is too high, start tuning */
 pme_lb->bBalance = dd_pme_f_ratio(cr->dd) >= loadBalanceTriggerFactor);

The ratio is wrong with the new code paths, since CPU timers are set before the CPU has synchronized to the GPU, and bBalance is set to false even when load balancing can be beneficial.

Patch https://gerrit.gromacs.org/c/gromacs/+/14622 fixes the ratio by syncing PME and PP CPUs to their respective GPUs whilst load balancing is active.

Alternatively, if we unconditionally set bBalance to true, this is another (simpler) way of allowing load balancing to kick-in. This could be done only if the new code paths are active - thoughts welcome.

As can be seen below, for 4x GPU the PME task is the bottleneck and load-balancing is beneficial. For 2X GPU there is much less of an effect since the PP task is the bottleneck.

run with:
export GMX_USE_GPU_BUFFER_OPS=1
export GMX_GPU_DD_COMMS=1
export GMX_GPU_PME_PP_COMMS=1
export GMX_FORCE_UPDATE_DEFAULT_GPU=1
gmx mdrun -s topol.tpr -ntomp 8 -pme gpu -nb gpu -ntmpi 4 -npme 1 -nsteps 50000 -resetstep 40000 -v -pin on -bonded gpu -noconfout -update gpu -nstlist 200

#6 Updated by Szilárd Páll 10 months ago

Alan Gray wrote:

The problem is, in pme_loadbal_do():

[...]

The ratio is wrong with the new code paths, since CPU timers are set before the CPU has synchronized to the GPU, and bBalance is set to false even when load balancing can be beneficial.

I think the problem is a fundamental one: the load balancer was not designed for async work execution and in a context where there is no ability to measure the time the PP and PME work took. For that reason, as noted on gerrit, we should not try to shoehorn the new code-path into the existing load balancing scheme, I think.

Patch https://gerrit.gromacs.org/c/gromacs/+/14622 fixes the ratio by syncing PME and PP CPUs to their respective GPUs whilst load balancing is active.

That still does not necessarily measure the load that we want to calculate the ratio of PP to PME work from.

Alternatively, if we unconditionally set bBalance to true, this is another (simpler) way of allowing load balancing to kick-in. This could be done only if the new code paths are active - thoughts welcome.

I suggest to pass the dev flags (let's not add more getenv) and make it unconditional when the PP-PME comms flag is set. I think that's the best we can at this stage (and in release-2020.
Otherwise, in the future, we need to devise a significantly different load balancing scheme (both DD and PP-PME), especially considering that timing in CUDA is not possible. We need to either figure out a way to measure load directly on the GPU or at least the time we spent waiting before the reduction kernel on the PP or PME forces to be ready and use that wait time as above.

Unrelated side-note: the DD DLB is also broken as we can't measure PP load either; for that to work in a sensible way, we should probably switch to cycle-based balancing.

#7 Updated by Alan Gray 10 months ago

OK, makes sense - I've now uploaded https://gerrit.gromacs.org/c/gromacs/+/14898 which does this, as a replacement for https://gerrit.gromacs.org/c/gromacs/+/14622

#8 Updated by Szilárd Páll 10 months ago

  • Description updated (diff)

#9 Updated by Paul Bauer 9 months ago

  • Status changed from New to Resolved

#10 Updated by Paul Bauer 9 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF