Project

General

Profile

Bug #2397

Difference between single rank and multiple rank when pulling using constraints relative to rest of the system

Added by Magnus Lundborg 10 months ago. Updated 12 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I noticed that I get completely different e.g. pull forces and temperature on a run started on a GPU server running NB and PME on GPU and then continued from a restart file on the PDC Beskow supercomputer running MPI (no GPUs).

Continuing from a checkpoint on a different hardware would not make the results binary identical, but in this case the difference is remarkable. The pull forces and temperature fluctuations are a lot higher on Beskow. I guess something is going wrong and I guess the output from the GPU server is correct, based only on the fact that it is more stable.

I'm attaching the pull force and temperature output and the log file from a run where the first 100 ps are run on a GPU server, the next 200 ps on Beskow and then 200 ps on the GPU server again.

pullf.xvg (64.1 KB) pullf.xvg Magnus Lundborg, 02/02/2018 02:22 PM
temperature.xvg (929 Bytes) temperature.xvg Magnus Lundborg, 02/02/2018 02:22 PM
md.log (55.4 KB) md.log Magnus Lundborg, 02/02/2018 02:22 PM
topol.tpr (1.19 MB) topol.tpr Magnus Lundborg, 02/02/2018 02:22 PM

Associated revisions

Revision aa102e69 (diff)
Added by Berk Hess 3 months ago

Add check for pull group PBC to grompp

Pull groups that use a reference atom for periodic boundary treatment
should have all their atoms well within half the box size of this
reference. When this is not the case, grompp will now issue a warning.

Refs #2397

Change-Id: Ida7004624a470981d9ce22a1ef921daebad83364

Revision 2a77f97d (diff)
Added by Magnus Lundborg 2 months ago

Fixes a bug in pull group size calculation

The wrong atom indexes were used when checking the coordinates
of atoms in a pull group (commit aa102e691d59b4de37c8e4).
That lead to false reports of too large pull group
(and presumably false negatives). This fixes the problem.

Refs #2397

Change-Id: Ib7d7e648204c0d1b219714610de7fb5842713048

History

#1 Updated by Magnus Lundborg 10 months ago

Would it be possible that domain decomposition might affect the center of mass positions and thereby upset pulling using constraints? I haven't seen this when pulling with an absolute reference - only related to the rest of the system.

#2 Updated by Berk Hess 10 months ago

Could it be that with MPI the global communication frequency, and thus also the COM removal frequency, is automatically increased?

#3 Updated by Magnus Lundborg 10 months ago

  • Subject changed from Difference between MPI and thread-MPI version pulling using constraints relative to rest of the system to Difference between single rank and multiple rank when pulling using constraints relative to rest of the system

The problem was identified to be related to single rank vs multiple rank. Subject updated.

#4 Updated by Mark Abraham 9 months ago

  • Assignee set to Berk Hess
  • Target version changed from 2018.1 to 2018.2

Berk is still looking into this

#5 Updated by Mark Abraham 6 months ago

Have we progressed here?

#6 Updated by Mark Abraham 5 months ago

  • Target version changed from 2018.2 to 2018.3

#7 Updated by Berk Hess 3 months ago

  • Status changed from New to Feedback wanted

I don't recall the status of this.
Magnus, is this still an issue, or did this somehow get resolved?

#8 Updated by Magnus Lundborg 3 months ago

The behavior still remains with the current master branch. It is possible that this is due to a too large pull group (in this case I am pulling two molecules related to the rest of the system) causing problems with the center of mass on multiple ranks.

#9 Updated by Paul Bauer 3 months ago

  • Target version changed from 2018.3 to 2018.4

Hello, if the issue is still present I'll move the target to the next point release, if people think it might be possible to resolve it until then.

#10 Updated by Berk Hess 3 months ago

The issue is that the pull group contains atoms further away than half the box size from the pbc atom. Such cases should not be supported.
I thought there was a check for this, but there is not. Should I add a check for release-2018?

#11 Updated by Magnus Lundborg 3 months ago

I guess such a check could be a good idea. But what if atoms move to be further away from the PBC than half the box size during the simulation? Then there would still be a problem, I guess, and I don't think it would be fixed by https://gerrit.gromacs.org/#/c/8060/ .

#12 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '1' for Issue #2397.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~Ida7004624a470981d9ce22a1ef921daebad83364
Gerrit URL: https://gerrit.gromacs.org/8193

#13 Updated by Berk Hess 3 months ago

Checking every step is too expensive. We could check periodically in mdrun.

#14 Updated by Berk Hess 3 months ago

Still, periodic checks should not be added in a minor patch version.

#15 Updated by Erik Lindahl 3 months ago

If it causes silent incorrect results I think we need to add it, patch release or not?

#16 Updated by Berk Hess 3 months ago

It might be rather tricky to avoid both false positives and false negatives.

By far the most common case is that group is already too large at the start of the run. mdrun checks I would not want to put into a patch release.

#17 Updated by Gerrit Code Review Bot 2 months ago

Gerrit received a related patchset '1' for Issue #2397.
Uploader: Magnus Lundborg ()
Change-Id: gromacs~master~Ib7d7e648204c0d1b219714610de7fb5842713048
Gerrit URL: https://gerrit.gromacs.org/8406

#18 Updated by Paul Bauer 15 days ago

Is this fixed as of now?

#19 Updated by Berk Hess 15 days ago

  • Status changed from Feedback wanted to Resolved

In relase-2018 a check has been added to grompp that should already case the most common cases of this issue.
For the 2019 release the check is much tighter and there is an new PBC option that works much better.

#20 Updated by Paul Bauer 12 days ago

  • Status changed from Resolved to Closed

ok, good enough for me

Also available in: Atom PDF