8-way parallel run of small system does not match forces of single-thread run after first step
When checking energy conservation for a pretty small system (~900atoms) for different free energy settings, I noticed that the energy conservation appears to depend on the number of cores used, and this is present for non-free-energy runs too. The attached system is a single ethanol in water, 939 atoms, and a rhombic dodecahedron box with side=2.4nm. When running with verlet kernels, reaction-field, lincs-order=3, and DLB, only simulations with a single thread appears to conserve energy, and with 8 cores the drift is horrible.
Even when limiting it to a single step and running with -reprod and -dlb no, there are large differences in forces after the first step. In this case it only appears to affect the 8-core simulation, though - the others seem to match the single-core-run. The error is present for group kernels too, and with PME (although the energy drift is smaller there, likely because the direct-space interactions are lower magnitude).
This is using the release-4-6 branch of git, and running on amd1.theophys.kth.se with acceleration enabled.
Adds cut-off checks for triclinic domain decomposition
With domain decomposition and 2 decomposition cells in a trilinic
dimension, the cut-off could be longer than the size of the
communicated domains. This could lead to some pairs close to cut-off
distance to be ignored in the force/energy calculations.
#1 Updated by Berk Hess almost 5 years ago
- Priority changed from High to 6
The issue here is that the domain decomposition code does not check if the chosen domain decomposition grid ensures that all atoms required for pairs within the cut-off distance can be communicated. With rectangular boxes or with 3 or more domains in a dimension, the standard check of the cut-off being shorter than half the box size ensures all pairs are available. So this issue only appears for triclinic dimensions with exactly 2 domains. Here some non-bonded energies/forces of pairs close to the cut-off distance could be missing, which can lead to silent errors.