Project

General

Profile

Bug #2790

simulation error: update groups moved too far

Added by Szilárd Páll 12 months ago. Updated 11 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
2019-rc1-dev-20181203-f942ad3
Affected version:
Difficulty:
uncategorized
Close

Description

Simulation inputs: ion channel (2fs no vsites) system on an i9-7920X + a GV100.

gmx mdrun -ntmpi 12 -ntomp 1 -s topol.tpr -v -quiet -noconfout -npme 0 -pin on  -nb gpu -nsteps 10000 -resetstep 8000  -pinstride 2  -pme cpu -tunepme -bonded cpu
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see
log).
Reading file topol.tpr, VERSION 5.1-dev-20150218-4c60631 (single precision)
Note: file tpx version 100, software tpx version 116
Overriding nsteps with value passed on the command line: 10000 steps, 20 ps
Changing nstlist from 10 to 100, rlist from 1 to 1.132

Using 12 MPI threads
Using 1 OpenMP thread per tMPI thread

On host skylake-x-gpu01 1 GPU auto-selected for this run.
Mapping of GPU IDs to the 12 GPU tasks in the 12 ranks on this node:
  PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU   

NOTE: Your choice of number of MPI ranks and amount of resources results in using 1 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'Protein'
10000 steps,     20.0 ps.
step  200: timed with pme grid 96 96 128, coulomb cutoff 1.000: 1867.9 M-cycles
step  400: timed with pme grid 84 84 112, coulomb cutoff 1.116: 1955.0 M-cycles
step  600: timed with pme grid 72 72 96, coulomb cutoff 1.302: 1768.6 M-cycles
step  800: timed with pme grid 64 64 84, coulomb cutoff 1.472: 1611.7 M-cycles
step  800: the maximum allowed grid scaling limits the PME load balancing to a coulomb cut-off of 1.563
step 1000: timed with pme grid 60 60 80, coulomb cutoff 1.563: 1557.1 M-cycles
step 1200: timed with pme grid 64 64 80, coulomb cutoff 1.545: 1582.9 M-cycles
step 1400: timed with pme grid 64 64 84, coulomb cutoff 1.472: 1606.8 M-cycles
step 1600: timed with pme grid 64 64 96, coulomb cutoff 1.465: 1672.3 M-cycles
step 1800: timed with pme grid 80 80 96, coulomb cutoff 1.288: 1813.9 M-cycles
step 2000: timed with pme grid 80 80 100, coulomb cutoff 1.236: 1624.0 M-cycles
step 2200: timed with pme grid 80 80 104, coulomb cutoff 1.189: 1724.9 M-cycles
step 2400: timed with pme grid 80 80 108, coulomb cutoff 1.172: 1665.6 M-cycles
step 2600: timed with pme grid 84 84 108, coulomb cutoff 1.145: 1789.4 M-cycles
              optimal pme grid 60 60 80, coulomb cutoff 1.563

NOTE: DLB can now turn on, when beneficial
step 7900, remaining wall clock time:    15 s          vol 0.55  imb F  1%
step 8000: resetting all time and cycle counters
step 8400, remaining wall clock time:    11 s          vol 0.56  imb F  2%
Step 8500:
The update group starting at atom 139493 moved more than the distance allowed by the domain decomposition (1.340496) in direction Z
distance out of cell -1.394382
Old coordinates:    8.020    9.521    8.314
New coordinates:    8.020    9.521    8.314
Old cell boundaries in direction Z:    9.708   14.835
New cell boundaries in direction Z:    9.708   14.835

-------------------------------------------------------
Program:     gmx mdrun, version 2019-rc1-dev-20181203-f942ad3
Source file: src/gromacs/domdec/redistribute.cpp (line 226)
MPI rank:    2 (out of 12)

Fatal error:
One or more atoms moved too far between two domain decomposition steps.
This usually means that your system is not well equilibrated

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Related issues

Related to GROMACS - Bug #2712: segv in constraintsClosed

Associated revisions

Revision a4ce3b95 (diff)
Added by Berk Hess 12 months ago

Fix update groups with 2D/3D DLB

With a staggered DD grid update groups could end up in the wrong DD
cell. This caused a fatal error (no incorrect results).
his change reverts most of d29cb9da which was a failed fix for #2712.

Also added a few const qualifiers and renames pos to cog for clarity.

Refs #2712
Fixes #2790

Change-Id: I1a589cccb6ea7048fb66ae867716549a1a615b7f

History

#1 Updated by Berk Hess 12 months ago

  • Status changed from New to In Progress
  • Assignee set to Berk Hess
  • Target version set to 2019-rc1

The update group indices have become inconsistent.

#2 Updated by Berk Hess 12 months ago

  • Priority changed from Normal to High

The update groups seem to be consistent.
I think it's rather that update groups end up on the wrong domain during repartitioning (only 2D/3D?).
Bumping to high as this bug likely makes it impossible to run at medium/high parallelization, although all output is very likely correct.

#3 Updated by Berk Hess 12 months ago

  • Related to Bug #2712: segv in constraints added

#4 Updated by Berk Hess 12 months ago

It's indeed update groups ending up on the wrong domain with 2D/3D with DLB. I incorrectly fixed #2712.

#5 Updated by Gerrit Code Review Bot 12 months ago

Gerrit received a related patchset '1' for Issue #2790.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I1a589cccb6ea7048fb66ae867716549a1a615b7f
Gerrit URL: https://gerrit.gromacs.org/8794

#6 Updated by Berk Hess 12 months ago

  • Status changed from In Progress to Fix uploaded

#7 Updated by Berk Hess 12 months ago

  • Status changed from Fix uploaded to Resolved

#8 Updated by Paul Bauer 11 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF