Bug #1467
8-way parallel run of small system does not match forces of single-thread run after first step
Description
When checking energy conservation for a pretty small system (~900atoms) for different free energy settings, I noticed that the energy conservation appears to depend on the number of cores used, and this is present for non-free-energy runs too. The attached system is a single ethanol in water, 939 atoms, and a rhombic dodecahedron box with side=2.4nm. When running with verlet kernels, reaction-field, lincs-order=3, and DLB, only simulations with a single thread appears to conserve energy, and with 8 cores the drift is horrible.
Even when limiting it to a single step and running with -reprod and -dlb no, there are large differences in forces after the first step. In this case it only appears to affect the 8-core simulation, though - the others seem to match the single-core-run. The error is present for group kernels too, and with PME (although the energy drift is smaller there, likely because the direct-space interactions are lower magnitude).
This is using the release-4-6 branch of git, and running on amd1.theophys.kth.se with acceleration enabled.
Associated revisions
History
#1 Updated by Berk Hess almost 7 years ago
- Priority changed from High to 6
The issue here is that the domain decomposition code does not check if the chosen domain decomposition grid ensures that all atoms required for pairs within the cut-off distance can be communicated. With rectangular boxes or with 3 or more domains in a dimension, the standard check of the cut-off being shorter than half the box size ensures all pairs are available. So this issue only appears for triclinic dimensions with exactly 2 domains. Here some non-bonded energies/forces of pairs close to the cut-off distance could be missing, which can lead to silent errors.
#2 Updated by Gerrit Code Review Bot almost 7 years ago
Gerrit received a related patchset '1' for Issue #1467.
Uploader: Berk Hess (hess@kth.se)
Change-Id: Id7e16d7f8fa0796d6adcf48ad6e8bbb0b88039ff
Gerrit URL: https://gerrit.gromacs.org/3285
#3 Updated by Berk Hess almost 7 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset a586b4168d35113cb5e3f3315aa73bebcf20b1c3.
#4 Updated by Roland Schulz almost 7 years ago
- Status changed from Resolved to Closed
Adds cut-off checks for triclinic domain decomposition
With domain decomposition and 2 decomposition cells in a trilinic
dimension, the cut-off could be longer than the size of the
communicated domains. This could lead to some pairs close to cut-off
distance to be ignored in the force/energy calculations.
Fixes #1467
Change-Id: Id7e16d7f8fa0796d6adcf48ad6e8bbb0b88039ff