Bug #2397
Difference between single rank and multiple rank when pulling using constraints relative to rest of the system
Description
I noticed that I get completely different e.g. pull forces and temperature on a run started on a GPU server running NB and PME on GPU and then continued from a restart file on the PDC Beskow supercomputer running MPI (no GPUs).
Continuing from a checkpoint on a different hardware would not make the results binary identical, but in this case the difference is remarkable. The pull forces and temperature fluctuations are a lot higher on Beskow. I guess something is going wrong and I guess the output from the GPU server is correct, based only on the fact that it is more stable.
I'm attaching the pull force and temperature output and the log file from a run where the first 100 ps are run on a GPU server, the next 200 ps on Beskow and then 200 ps on the GPU server again.
Associated revisions
Fixes a bug in pull group size calculation
The wrong atom indexes were used when checking the coordinates
of atoms in a pull group (commit aa102e691d59b4de37c8e4).
That lead to false reports of too large pull group
(and presumably false negatives). This fixes the problem.
Refs #2397
Change-Id: Ib7d7e648204c0d1b219714610de7fb5842713048
History
#1 Updated by Magnus Lundborg about 3 years ago
Would it be possible that domain decomposition might affect the center of mass positions and thereby upset pulling using constraints? I haven't seen this when pulling with an absolute reference - only related to the rest of the system.
#2 Updated by Berk Hess about 3 years ago
Could it be that with MPI the global communication frequency, and thus also the COM removal frequency, is automatically increased?
#3 Updated by Magnus Lundborg about 3 years ago
- Subject changed from Difference between MPI and thread-MPI version pulling using constraints relative to rest of the system to Difference between single rank and multiple rank when pulling using constraints relative to rest of the system
The problem was identified to be related to single rank vs multiple rank. Subject updated.
#4 Updated by Mark Abraham about 3 years ago
- Assignee set to Berk Hess
- Target version changed from 2018.1 to 2018.2
Berk is still looking into this
#5 Updated by Mark Abraham over 2 years ago
Have we progressed here?
#6 Updated by Mark Abraham over 2 years ago
- Target version changed from 2018.2 to 2018.3
#7 Updated by Berk Hess over 2 years ago
- Status changed from New to Feedback wanted
I don't recall the status of this.
Magnus, is this still an issue, or did this somehow get resolved?
#8 Updated by Magnus Lundborg over 2 years ago
The behavior still remains with the current master branch. It is possible that this is due to a too large pull group (in this case I am pulling two molecules related to the rest of the system) causing problems with the center of mass on multiple ranks.
#9 Updated by Paul Bauer over 2 years ago
- Target version changed from 2018.3 to 2018.4
Hello, if the issue is still present I'll move the target to the next point release, if people think it might be possible to resolve it until then.
#10 Updated by Berk Hess over 2 years ago
The issue is that the pull group contains atoms further away than half the box size from the pbc atom. Such cases should not be supported.
I thought there was a check for this, but there is not. Should I add a check for release-2018?
#11 Updated by Magnus Lundborg over 2 years ago
I guess such a check could be a good idea. But what if atoms move to be further away from the PBC than half the box size during the simulation? Then there would still be a problem, I guess, and I don't think it would be fixed by https://gerrit.gromacs.org/#/c/8060/ .
#12 Updated by Gerrit Code Review Bot over 2 years ago
Gerrit received a related patchset '1' for Issue #2397.
Uploader: Berk Hess (hess@kth.se)
Change-Id: gromacs~release-2018~Ida7004624a470981d9ce22a1ef921daebad83364
Gerrit URL: https://gerrit.gromacs.org/8193
#13 Updated by Berk Hess over 2 years ago
Checking every step is too expensive. We could check periodically in mdrun.
#14 Updated by Berk Hess over 2 years ago
Still, periodic checks should not be added in a minor patch version.
#15 Updated by Erik Lindahl over 2 years ago
If it causes silent incorrect results I think we need to add it, patch release or not?
#16 Updated by Berk Hess over 2 years ago
It might be rather tricky to avoid both false positives and false negatives.
By far the most common case is that group is already too large at the start of the run. mdrun checks I would not want to put into a patch release.
#17 Updated by Gerrit Code Review Bot over 2 years ago
Gerrit received a related patchset '1' for Issue #2397.
Uploader: Magnus Lundborg (magnus.lundborg@scilifelab.se)
Change-Id: gromacs~master~Ib7d7e648204c0d1b219714610de7fb5842713048
Gerrit URL: https://gerrit.gromacs.org/8406
#18 Updated by Paul Bauer over 2 years ago
Is this fixed as of now?
#19 Updated by Berk Hess over 2 years ago
- Status changed from Feedback wanted to Resolved
In relase-2018 a check has been added to grompp that should already case the most common cases of this issue.
For the 2019 release the check is much tighter and there is an new PBC option that works much better.
#20 Updated by Paul Bauer over 2 years ago
- Status changed from Resolved to Closed
ok, good enough for me
Add check for pull group PBC to grompp
Pull groups that use a reference atom for periodic boundary treatment
should have all their atoms well within half the box size of this
reference. When this is not the case, grompp will now issue a warning.
Refs #2397
Change-Id: Ida7004624a470981d9ce22a1ef921daebad83364