Bug #2441
Value of AWH coordinate not set for certain numbers of MPI ranks
Description
I have a simulation with AWH that runs fine for a smaller number of ranks, e.g.: aprun -n 32 -N 32 $gmx mdrun -npme 0 -ntomp 1
but when instead using -n 64
I get:
Simulation instability detected: Function: void gmx::CoordState::setCoordValue(const gmx::Grid&, const double*) Coordinate 1 of an AWH bias has a value 0.000000 which is more than 10 sigma MPI rank: 32 (out of 64) out of the AWH range of [0.250000, 0.600000]. You seem to have an unstable reaction coordinate setup or an unequilibrated system.
Related issues
Associated revisions
History
#1 Updated by Szilárd Páll over 1 year ago
- Status changed from New to Accepted
Reproduced. My fist thought was it may be due to the switch from 2D to 3D decomposition, but I've just tried and I observe:
- 32 ranks npme 0, 8x4x1 DD working
- 40 ranks npme 0, 8x5x1 DD not working
- 64 ranks npme 16 (same as default), 4x4x3 DD not working
- 64 ranks npme 16 (same as default), 12x4x1 DD not working
#2 Updated by Berk Hess over 1 year ago
- Category set to mdrun
- Assignee set to Berk Hess
This is because with #PP-ranks > 32, the pull code is only active on a subset of the ranks to minimize communication.
It is rather complex to also make the AWH code only work on subranks, so I think I will implement the simple solution of turning off the pull sub communicator when we use external pull potentials.
#3 Updated by Gerrit Code Review Bot over 1 year ago
Gerrit received a related patchset '1' for Issue #2441.
Uploader: Berk Hess (hess@kth.se)
Change-Id: gromacs~release-2018~I8501024b7961600ec79f3707e239ddf25525aa79
Gerrit URL: https://gerrit.gromacs.org/7665
#4 Updated by Berk Hess over 1 year ago
- Status changed from Accepted to Fix uploaded
- Target version set to 2018.1
#5 Updated by Szilárd Páll over 1 year ago
- Affected version changed from 2018.1 to 2018
Changed affected version because it's technically note 2018.1.
Berk Hess wrote:
This is because with #PP-ranks > 32, the pull code is only active on a subset of the ranks to minimize communication.
It is rather complex to also make the AWH code only work on subranks, so I think I will implement the simple solution of turning off the pull sub communicator when we use external pull potentials.
Sounds like a gap to close in integration testing. >32 ranks even when combined with pull / AWH should ideally be covered by weekly tests rather than tested when first used in production.
#6 Updated by Berk Hess over 1 year ago
- Status changed from Fix uploaded to Resolved
Applied in changeset 97240d254619689932102b5a7b077ff3070188fe.
#7 Updated by Mark Abraham over 1 year ago
- Status changed from Resolved to Closed
#8 Updated by Szilárd Páll over 1 year ago
- Related to Task #2488: use MPI non-blocking collectives to overlap pull comm added
Fix COM pulling with external potential with #ranks>32
With more than 32 PP-ranks, the pull code could use only a subset
of the PP-ranks. This change forces all ranks to do pulling when
external potentials are present (currently only used by AWH).
Fixes #2441
Change-Id: I8501024b7961600ec79f3707e239ddf25525aa79