Project

General

Profile

Bug #2441

Value of AWH coordinate not set for certain numbers of MPI ranks

Added by Viveca Lindahl about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
GROMACS version: 2018.1-dev-20180306-33093601f
Affected version:
Difficulty:
uncategorized
Close

Description

I have a simulation with AWH that runs fine for a smaller number of ranks, e.g.: aprun -n 32 -N 32 $gmx mdrun -npme 0 -ntomp 1 but when instead using -n 64 I get:

Simulation instability detected:
Function:    void gmx::CoordState::setCoordValue(const gmx::Grid&, const double*)
Coordinate 1 of an AWH bias has a value 0.000000 which is more than 10 sigma
MPI rank:    32 (out of 64)
out of the AWH range of [0.250000, 0.600000]. You seem to have an unstable
reaction coordinate setup or an unequilibrated system.

topol.tpr (1.35 MB) topol.tpr Viveca Lindahl, 03/11/2018 11:50 AM

Related issues

Related to GROMACS - Task #2488: use MPI non-blocking collectives to overlap pull commNew

Associated revisions

Revision 97240d25 (diff)
Added by Berk Hess about 1 year ago

Fix COM pulling with external potential with #ranks>32

With more than 32 PP-ranks, the pull code could use only a subset
of the PP-ranks. This change forces all ranks to do pulling when
external potentials are present (currently only used by AWH).

Fixes #2441

Change-Id: I8501024b7961600ec79f3707e239ddf25525aa79

History

#1 Updated by Szilárd Páll about 1 year ago

  • Status changed from New to Accepted

Reproduced. My fist thought was it may be due to the switch from 2D to 3D decomposition, but I've just tried and I observe:
- 32 ranks npme 0, 8x4x1 DD working
- 40 ranks npme 0, 8x5x1 DD not working
- 64 ranks npme 16 (same as default), 4x4x3 DD not working
- 64 ranks npme 16 (same as default), 12x4x1 DD not working

#2 Updated by Berk Hess about 1 year ago

  • Category set to mdrun
  • Assignee set to Berk Hess

This is because with #PP-ranks > 32, the pull code is only active on a subset of the ranks to minimize communication.
It is rather complex to also make the AWH code only work on subranks, so I think I will implement the simple solution of turning off the pull sub communicator when we use external pull potentials.

#3 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2441.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I8501024b7961600ec79f3707e239ddf25525aa79
Gerrit URL: https://gerrit.gromacs.org/7665

#4 Updated by Berk Hess about 1 year ago

  • Status changed from Accepted to Fix uploaded
  • Target version set to 2018.1

#5 Updated by Szilárd Páll about 1 year ago

  • Affected version changed from 2018.1 to 2018

Changed affected version because it's technically note 2018.1.

Berk Hess wrote:

This is because with #PP-ranks > 32, the pull code is only active on a subset of the ranks to minimize communication.
It is rather complex to also make the AWH code only work on subranks, so I think I will implement the simple solution of turning off the pull sub communicator when we use external pull potentials.

Sounds like a gap to close in integration testing. >32 ranks even when combined with pull / AWH should ideally be covered by weekly tests rather than tested when first used in production.

#6 Updated by Berk Hess about 1 year ago

  • Status changed from Fix uploaded to Resolved

#7 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to Closed

#8 Updated by Szilárd Páll 11 months ago

  • Related to Task #2488: use MPI non-blocking collectives to overlap pull comm added

Also available in: Atom PDF