significant performance loss due to DLB auto-off when PP/PME load >1
We have automation in place that keeps DLB off when the force computation is dominated by PME (i.e. relative PP/PME load >1), even when the measured imbalance is higher than the threshold for switching.
I however noticed that switching DLB on in at least one such case does give major performance improvement (in the attached case 18%).
We should revise the automation to avoid performance loss.
Also note that the DLB report output is incorrect as the above heuristic leads to silent override which is reported as:
Dynamic load balancing report: DLB was off during the run due to low measured imbalance. Average load imbalance: 18.0%.
#1 Updated by Szilárd Páll 6 months ago
- Description updated (diff)
Having looked into the specifics of the run what we concluded is that:
- this is a new use-case enabled by the update groups as ranks can now get out of sync and load imbalance leads to wait in x exchange and PME redist;
- the runs are not able to switch to the 0.919 cutoff during PME tuning, which could lower the PME cost, might be the reason why DLB ends up remaining off.
The latter suggests that, if the PP-PME tuning would successfully switch to a coarser grid, DLB may kick in and the expected
-dlb yes performance could be achieved. WIP runs with twin cutoff to test.
#2 Updated by Szilárd Páll 6 months ago
- File test_dev-purley02_80x1_dlb-auto_GMX20_twin-cut_LONG_1.log test_dev-purley02_80x1_dlb-auto_GMX20_twin-cut_LONG_1.log added
- File test_dev-purley02_80x1_dlb-auto_GMX20_twin-cut_LONG_2.log test_dev-purley02_80x1_dlb-auto_GMX20_twin-cut_LONG_2.log added
Data shows that with twin cut-off, as expected, DLB does kick in, but unexpectedly, the performance degradation check fires and in 3/5 run it turns it off. Two example logs attached.
The degradation check fires at the counter reset step. I think some logic is incorrect there. This likely doesn't affect production runs then.
So we should turn off the PME limit check when we do not have inter-domain constraints. The question is if there is more we can do.
Note that the DLB degradation check is based on the cycles for the full step and exponential averaging, so that is quite safe. But the reference value is problematic, as that is an average of nstlist-1 steps from just before DLB got turned on. This reference might be affected by noise and maybe by not full CPU clocks when it is taken just after starting mdrun.