DLB + PME tuning inconsistency
While the output claims that DLB does not turn off during PP-PME load balancing - which is the expected behavior, in fact I've just done a run where it did turn on:
$ $gmx mdrun -quiet -v -nsteps -1 Back Off! I just backed up md.log to ./#md.log.21# Running on 1 node with total 4 cores, 8 logical cores, 2 compatible GPUs Hardware detected: CPU info: Vendor: GenuineIntel Brand: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz SIMD instructions most likely to fit this hardware: AVX2_256 SIMD instructions selected at GROMACS compile time: AVX2_256 GPU info: Number of GPUs detected: 2 #0: NVIDIA GeForce GTX 960, compute cap.: 5.2, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 760, compute cap.: 3.0, ECC: no, stat: compatible Reading file topol.tpr, VERSION 4.6-beta3-dev-20121222-492378e (single precision) Note: file tpx version 82, software tpx version 103 Changing nstlist from 10 to 40, rlist from 1 to 1.101 Overriding nsteps with value passed on the command line: -1 steps Using 2 MPI threads Using 4 OpenMP threads per tMPI thread 2 compatible GPUs are present, with IDs 0,1 2 GPUs auto-selected for this run. Mapping of GPU IDs to the 2 PP ranks in this node: 0,1 Back Off! I just backed up ener.edr to ./#ener.edr.16# NOTE: DLB will not turn on during the first phase of PME tuning starting mdrun 'Water' -1 steps, infinite ps. step 80: timed with pme grid 100 52 52, coulomb cutoff 1.000: 1605.8 M-cycles step 160: timed with pme grid 84 42 42, coulomb cutoff 1.187: 1471.5 M-cycles step 240: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1397.4 M-cycles NOTE: Turning on dynamic load balancing NOTE: the minimum cell size is smaller than 1.05 times the cell size limit, will not turn on dynamic load balancing step 320: timed with pme grid 64 32 32, coulomb cutoff 1.557: 1680.1 M-cycles step 400: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1035.3 M-cycles step 480: timed with pme grid 80 40 40, coulomb cutoff 1.246: 1084.4 M-cycles step 560: timed with pme grid 84 42 42, coulomb cutoff 1.187: 1126.3 M-cycles step 640: timed with pme grid 96 44 44, coulomb cutoff 1.133: 1147.4 M-cycles step 720: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1224.6 M-cycles step 800: timed with pme grid 80 40 40, coulomb cutoff 1.246: 997.3 M-cycles step 880: timed with pme grid 84 42 42, coulomb cutoff 1.187: 1279.8 M-cycles step 960: timed with pme grid 96 44 44, coulomb cutoff 1.133: 1416.5 M-cycles step 1040: timed with pme grid 72 36 36, coulomb cutoff 1.384: 1171.5 M-cycles step 1120: timed with pme grid 80 40 40, coulomb cutoff 1.246: 1393.7 M-cycles optimal pme grid 80 40 40, coulomb cutoff 1.246 NOTE: DLB can now turn on, when beneficial NOTE: DLB can now turn on, when beneficial
Also, the notes above could use de-duplication.
Fix two PME DLB trigger issues
Dynamic load balancing got triggered while locked by PME load
balancing, because a check was placed incorrectly.
PME load balancing would never trigger with separate PME ranks
because a comparison was inverted.
Fix DD DLB state issue
The introduction of DLB locking for PME load balancing added another
DLB state, which was stored in a third variable. These variables
were not always all properly checked. Simplified the code by merging
these three state variables into one. In added there was a fourth
variable (bGridJump) is gmx_domdec_t, this is replaced by calls to
a functions returning is DLB is on.
#4 Updated by Szilárd Páll over 4 years ago
I've attached the input, but there is nothing special about it, it's not even an imbalanced system, it's just a box of water. The reported issue should be reproducible with any system as long as there is enough load imbalance measured with the tagged affected version (or current HEAD of rel 5.1) and it is fixed by the Berk's change.
What was not clear to me is whether the "NOTE: the minimum cell size is ..." message is triggered at the same step as the preceding one. Based on this output it has to be triggered either at the same step or the next nstlist step.