Project

General

Profile

Bug #1467

8-way parallel run of small system does not match forces of single-thread run after first step

Added by Erik Lindahl over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When checking energy conservation for a pretty small system (~900atoms) for different free energy settings, I noticed that the energy conservation appears to depend on the number of cores used, and this is present for non-free-energy runs too. The attached system is a single ethanol in water, 939 atoms, and a rhombic dodecahedron box with side=2.4nm. When running with verlet kernels, reaction-field, lincs-order=3, and DLB, only simulations with a single thread appears to conserve energy, and with 8 cores the drift is horrible.

Even when limiting it to a single step and running with -reprod and -dlb no, there are large differences in forces after the first step. In this case it only appears to affect the 8-core simulation, though - the others seem to match the single-core-run. The error is present for group kernels too, and with PME (although the energy drift is smaller there, likely because the direct-space interactions are lower magnitude).

This is using the release-4-6 branch of git, and running on amd1.theophys.kth.se with acceleration enabled.

dd_econs_bug.tgz - All input files (and output from running on amd1.theophys.kth.se) (380 KB) Erik Lindahl, 03/23/2014 02:57 PM

Associated revisions

Revision a586b416 (diff)
Added by Berk Hess over 3 years ago

Adds cut-off checks for triclinic domain decomposition

With domain decomposition and 2 decomposition cells in a trilinic
dimension, the cut-off could be longer than the size of the
communicated domains. This could lead to some pairs close to cut-off
distance to be ignored in the force/energy calculations.
Fixes #1467

Change-Id: Id7e16d7f8fa0796d6adcf48ad6e8bbb0b88039ff

History

#1 Updated by Berk Hess over 3 years ago

  • Priority changed from High to 6

The issue here is that the domain decomposition code does not check if the chosen domain decomposition grid ensures that all atoms required for pairs within the cut-off distance can be communicated. With rectangular boxes or with 3 or more domains in a dimension, the standard check of the cut-off being shorter than half the box size ensures all pairs are available. So this issue only appears for triclinic dimensions with exactly 2 domains. Here some non-bonded energies/forces of pairs close to the cut-off distance could be missing, which can lead to silent errors.

#2 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1467.
Uploader: Berk Hess ()
Change-Id: Id7e16d7f8fa0796d6adcf48ad6e8bbb0b88039ff
Gerrit URL: https://gerrit.gromacs.org/3285

#3 Updated by Berk Hess over 3 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

#4 Updated by Roland Schulz over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF