Project

General

Profile

Bug #3335

mdrun can deadlock with multiple ranks and separate PME ranks

Added by Szilárd Páll 7 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

As the PP-PME balancing absolute time delay is done based on measurements on each PP rank, the timing s can deviate which can make some ranks to enter the load balancing mode while others skip resulting in a deadlock.

Associated revisions

Revision 133c12ee (diff)
Added by Szilárd Páll 7 months ago

Fix deadlock in PP-PME balancing startup delay

The condition whether load balancing should start included a per-rank
computed fixed time-delay. Therefore different ranks could evaluate this
condition differently resulting in a deadlock.
This change fixes the deadlock by broadcasting the result of the time
delay check.

Fixes #3335

Change-Id: I39ebc7e99483a6837bdbd79e312148384c1966b3

History

#1 Updated by Szilárd Páll 7 months ago

  • Status changed from New to In Progress

#2 Updated by Szilárd Páll 7 months ago

  • Status changed from In Progress to Fix uploaded

#3 Updated by Szilárd Páll 7 months ago

  • Status changed from Fix uploaded to Resolved

#4 Updated by Szilárd Páll 7 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF