Project

General

Profile

Bug #1089

PP-PME load balancing discards tested fastest setting

Added by Szilárd Páll over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
4.6-beta2
Affected version:
Difficulty:
uncategorized
Close

Description

As a result of DD and PP-PME load balancing influencing each-other, mdrun can end up discarding a setting faster than any others tried as a result of the DD load balancing shrinking domains enough that the fast cut-off setting ends up being too long for the cell size.

This happens already at moderate level of parallelization, e.g the attached log files were obtained 1-2 nodes of a Cray XK7 (16C Bulldozer + K20X) running a 134k atoms system.

While a proper solution to this issue would be addressing the DD load imbalance itself, this requires substantial effort and therefore will happen post-4.6. However, the eager nature of the DD load balancing means that it can not only prevent an already discovered and known to be faster setting to be used - which a user could notice (although that's quite unlikely) -, but such a cut-off might not even get tested by the PP-PME load-balancing if the domains shrink fast enough.

Associated revisions

Revision 01da74de (diff)
Added by Berk Hess over 6 years ago

made PME load balancing + DD DLB more efficient

The DD dynamic load balancing is now limited, such that the fastest
timed PME load balancing cut-off setting can always be used.
Fixes #1089

Change-Id: I3216dfd5a8b2b0676eee5519e08cf36e06047251

History

#1 Updated by Berk Hess over 6 years ago

  • Status changed from New to Feedback wanted

I uploaded a fix to gerrit. I tried to test it on grizzly, but it's difficult to come up with a configuration where this happens. Could you test it?

#2 Updated by Berk Hess over 6 years ago

  • Status changed from Feedback wanted to Closed

Also available in: Atom PDF