Bug #3124

significant performance loss due to DLB auto-off when PP/PME load >1

Added by Szilárd Páll 12 months ago. Updated 8 months ago.

Target version:
Affected version - extra info:
Affected version:


We have automation in place that keeps DLB off when the force computation is dominated by PME (i.e. relative PP/PME load >1), even when the measured imbalance is higher than the threshold for switching.
I however noticed that switching DLB on in at least one such case does give major performance improvement (in the attached case 18%).

We should revise the automation to avoid performance loss.

Also note that the DLB report output is incorrect as the above heuristic leads to silent override which is reported as:

Dynamic load balancing report:
 DLB was off during the run due to low measured imbalance.
 Average load imbalance: 18.0%.


#1 Updated by Szilárd Páll 12 months ago

  • Description updated (diff)

Having looked into the specifics of the run what we concluded is that:
- this is a new use-case enabled by the update groups as ranks can now get out of sync and load imbalance leads to wait in x exchange and PME redist;
- the runs are not able to switch to the 0.919 cutoff during PME tuning, which could lower the PME cost, might be the reason why DLB ends up remaining off.

The latter suggests that, if the PP-PME tuning would successfully switch to a coarser grid, DLB may kick in and the expected -dlb yes performance could be achieved. WIP runs with twin cutoff to test.

#2 Updated by Szilárd Páll 12 months ago

Data shows that with twin cut-off, as expected, DLB does kick in, but unexpectedly, the performance degradation check fires and in 3/5 run it turns it off. Two example logs attached.

#3 Updated by Berk Hess 12 months ago

The degradation check fires at the counter reset step. I think some logic is incorrect there. This likely doesn't affect production runs then.

So we should turn off the PME limit check when we do not have inter-domain constraints. The question is if there is more we can do.

#4 Updated by Berk Hess 12 months ago

Note that the DLB degradation check is based on the cycles for the full step and exponential averaging, so that is quite safe. But the reference value is problematic, as that is an average of nstlist-1 steps from just before DLB got turned on. This reference might be affected by noise and maybe by not full CPU clocks when it is taken just after starting mdrun.

#5 Updated by Szilárd Páll 11 months ago

  • Target version set to 2020

2020 set as tentative target

#6 Updated by Paul Bauer 9 months ago

  • Target version changed from 2020 to 2021

changed to next year

#7 Updated by Szilárd Páll 8 months ago

Perhaps we should consider removing the the DLB limitations with update groups and try to turn on DLB even if the PME/PP load is >1?

Also available in: Atom PDF