Project

General

Profile

Bug #1760

DLB + PME tuning inconsistency

Added by Szilárd Páll over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
core library
Target version:
Affected version - extra info:
341fe0c
Affected version:
Difficulty:
uncategorized
Close

Description

While the output claims that DLB does not turn off during PP-PME load balancing - which is the expected behavior, in fact I've just done a run where it did turn on:

$ $gmx mdrun -quiet -v -nsteps -1 

Back Off! I just backed up md.log to ./#md.log.21#

Running on 1 node with total 4 cores, 8 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX 960, compute cap.: 5.2, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX 760, compute cap.: 3.0, ECC:  no, stat: compatible

Reading file topol.tpr, VERSION 4.6-beta3-dev-20121222-492378e (single precision)
Note: file tpx version 82, software tpx version 103
Changing nstlist from 10 to 40, rlist from 1 to 1.101

Overriding nsteps with value passed on the command line: -1 steps

Using 2 MPI threads
Using 4 OpenMP threads per tMPI thread

2 compatible GPUs are present, with IDs 0,1
2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 PP ranks in this node: 0,1

Back Off! I just backed up ener.edr to ./#ener.edr.16#

NOTE: DLB will not turn on during the first phase of PME tuning

starting mdrun 'Water'
-1 steps, infinite ps.
step   80: timed with pme grid 100 52 52, coulomb cutoff 1.000: 1605.8 M-cycles
step  160: timed with pme grid 84 42 42, coulomb cutoff 1.187: 1471.5 M-cycles
step  240: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1397.4 M-cycles

NOTE: Turning on dynamic load balancing

NOTE: the minimum cell size is smaller than 1.05 times the cell size limit, will not turn on dynamic load balancing

step  320: timed with pme grid 64 32 32, coulomb cutoff 1.557: 1680.1 M-cycles
step  400: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1035.3 M-cycles
step  480: timed with pme grid 80 40 40, coulomb cutoff 1.246: 1084.4 M-cycles
step  560: timed with pme grid 84 42 42, coulomb cutoff 1.187: 1126.3 M-cycles
step  640: timed with pme grid 96 44 44, coulomb cutoff 1.133: 1147.4 M-cycles
step  720: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1224.6 M-cycles
step  800: timed with pme grid 80 40 40, coulomb cutoff 1.246: 997.3 M-cycles
step  880: timed with pme grid 84 42 42, coulomb cutoff 1.187: 1279.8 M-cycles
step  960: timed with pme grid 96 44 44, coulomb cutoff 1.133: 1416.5 M-cycles
step 1040: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1171.5 M-cycles
step 1120: timed with pme grid 80 40 40, coulomb cutoff 1.246: 1393.7 M-cycles
              optimal pme grid 80 40 40, coulomb cutoff 1.246

NOTE: DLB can now turn on, when beneficial

NOTE: DLB can now turn on, when beneficial

Also, the notes above could use de-duplication.

topol.tpr (1.11 MB) topol.tpr Szilárd Páll, 07/02/2015 02:19 PM

Associated revisions

Revision 5e1339d8 (diff)
Added by Berk Hess over 4 years ago

Fix two PME DLB trigger issues

Dynamic load balancing got triggered while locked by PME load
balancing, because a check was placed incorrectly.
PME load balancing would never trigger with separate PME ranks
because a comparison was inverted.

Fixes #1760.
Fixes #1763.

Change-Id: I75eeb32423b864f84bfd45ecb61d169b473ed74a

Revision c1da1c9b (diff)
Added by Berk Hess over 4 years ago

Fix DD DLB state issue

The introduction of DLB locking for PME load balancing added another
DLB state, which was stored in a third variable. These variables
were not always all properly checked. Simplified the code by merging
these three state variables into one. In added there was a fourth
variable (bGridJump) is gmx_domdec_t, this is replaced by calls to
a functions returning is DLB is on.

Refs #1760.

Change-Id: I80d499149e4e5bfd689e76208384a8ba61e2842a

History

#1 Updated by Berk Hess over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Berk Hess

#2 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1760.
Uploader: Berk Hess ()
Change-Id: I75eeb32423b864f84bfd45ecb61d169b473ed74a
Gerrit URL: https://gerrit.gromacs.org/4823

#3 Updated by Mark Abraham over 4 years ago

I haven't succeeded at triggering this behaviour to see whether it is fixed. Can you share a .tpr please Szilard?

#4 Updated by Szilárd Páll over 4 years ago

I've attached the input, but there is nothing special about it, it's not even an imbalanced system, it's just a box of water. The reported issue should be reproducible with any system as long as there is enough load imbalance measured with the tagged affected version (or current HEAD of rel 5.1) and it is fixed by the Berk's change.

What was not clear to me is whether the "NOTE: the minimum cell size is ..." message is triggered at the same step as the preceding one. Based on this output it has to be triggered either at the same step or the next nstlist step.

#5 Updated by Berk Hess over 4 years ago

  • Status changed from In Progress to Resolved

The message is printed in turn_on_dlb, so it can only be at the same step.

#6 Updated by Szilárd Páll over 4 years ago

Not done just yet, but will be as soon as Change 4834 gets approved.

#7 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '3' for Issue #1760.
Uploader: Berk Hess ()
Change-Id: I80d499149e4e5bfd689e76208384a8ba61e2842a
Gerrit URL: https://gerrit.gromacs.org/4834

#8 Updated by Berk Hess over 4 years ago

  • % Done changed from 0 to 100

#9 Updated by Erik Lindahl over 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF