Project

General

Profile

Bug #2125

Hexagonal PBC and MPI

Added by Bart Bruininks almost 2 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
5.1.1 & 5.1.4 & 2016.1
Affected version:
Difficulty:
uncategorized
Close

Description

Hey GROMACS people,

I was recently trying to increase the efficiency of my membrane-particle fusion box, by changing the way PBC works to be more like a hexagon (10 10 10 0 0 5 0 10 5). I know this is not a perfect regular hexagon, for the ratio of the box should not be a square, but something like 10*3^0.5/2, but hey I guess it should also work. When I create a box with these dimension I can run perfectly on a hyperthreaded 6 core. However, when I move towards multiple nodes and start using MPI with GROMACS 2016 stuff starts to go severely wrong in less than 100 steps. I tried different versions of GROMACS (5.1.1 & 5.1.4 & 2016.1), but the issue was always the same. I couldn't say with 100% certainty that it is indeed the MPI which causes it to go wrong, but whenever I start asking for more cores than one node can provide the issue presents itself. I myself am not a hard programmer and wouldn't be able to solve the issue nor find the exact thing which goes wrong, but I just would like to point out the problem. I will attach the md.tpr file I have to run (It is a MARTINI system, but that should also not matter too much, running with -rdd 2.0 might be necessary though). Though a possibly a small bug, I think it could be worth it to solve this, for these hexagonal boxes are very nice for any particle migrating into a membrane.

Cheers and hopefully it can be resolved,

Bart

md.tpr (5.73 MB) md.tpr The tar which should reproduce the bug when used in combination with MPI Bart Bruininks, 02/16/2017 07:49 PM
md-rdd-2.log (19.1 KB) md-rdd-2.log Mark Abraham, 03/09/2017 04:27 PM
md-no-rdd.log (25.9 KB) md-no-rdd.log Mark Abraham, 03/09/2017 04:27 PM
md-single-rank.log (23.4 KB) md-single-rank.log Mark Abraham, 03/09/2017 05:05 PM

Associated revisions

Revision b1a0f28e (diff)
Added by Berk Hess about 1 year ago

Fix triclinic domain decomposition bug

With triclinic unit-cells with vectors a,b,c, the domain decomposition
would communicate an incorrect halo along dimension x when b[x]!=0
and vector c not parallel to the z-axis. The halo cut-off bound plane
was tilted incorrect along x/z with an error approximately
proportional to b[x]*(c[x] - b[x]*c[y]/b[y]).
When c[x] > b[x]*c[y]/b[y], the communicated halo was too small, which
could cause instabilities or silent errors.
When c[x] < b[x]*c[y]/b[y], the communicated halo was too large, which
could cause some communication overhead.

Fixes #2125

Change-Id: I2109542292beca5be26eddc262e0974c4ae825ea

Revision 3a338158 (diff)
Added by Berk Hess about 1 year ago

Fix triclinic domain decomposition bug

With triclinic unit-cells with vectors a,b,c, the domain decomposition
would communicate an incorrect halo along dimension x when b[x]!=0
and vector c not parallel to the z-axis. The halo cut-off bound plane
was tilted incorrect along x/z with an error approximately
proportional to b[x]*(c[x] - b[x]*c[y]/b[y]).
When c[x] > b[x]*c[y]/b[y], the communicated halo was too small, which
could cause instabilities or silent errors.
When c[x] < b[x]*c[y]/b[y], the communicated halo was too large, which
could cause some communication overhead.

Fixes #2125

Change-Id: I2109542292beca5be26eddc262e0974c4ae825ea
(cherry picked from commit b1a0f28eb503c5e7974dc8c998797cb71c3f0b42)

History

#1 Updated by Berk Hess almost 2 years ago

  • Status changed from New to Feedback wanted

This should just work.
But you do not specify what actually goes wrong. Do you get incorrect results, an error or a crash?

#2 Updated by Mark Abraham almost 2 years ago

With a single rank, the tpr seems happy to do a thousand steps. But with a number of ranks including 2, 3 and 6, I get DD missing interactions immediately. If I use rdd then I get that exactly 1 particle can't be communicated to the PME rank after about nstlist steps.

#3 Updated by Mark Abraham almost 2 years ago

  • Target version changed from 2016.3 to 2016.4

We're about to release 2016.3

#4 Updated by Szilárd Páll over 1 year ago

  • Status changed from Feedback wanted to Accepted
  • Affected version - extra info set to 5.1.1 & 5.1.4 & 2016.1

Added affected versions per issue description and given Mark's tests I'll switch to accept this issue.

#5 Updated by Mark Abraham over 1 year ago

  • Target version changed from 2016.4 to 2016.5

#6 Updated by Erik Lindahl about 1 year ago

Confirmed this is still present in 2018-rc1.

#7 Updated by Erik Lindahl about 1 year ago

  • Target version changed from 2016.5 to 2018

#8 Updated by Berk Hess about 1 year ago

  • Status changed from Accepted to In Progress
  • Assignee set to Berk Hess

I reproduced this. There is no illegal memory access, so this looks like a logic bug. I'll try to track this down.

#9 Updated by Berk Hess about 1 year ago

  • Priority changed from Normal to High

The communicated halo is missing a lot of atoms. The angle of the halo region is incorrect. This is a rather serious and in principle general issue, so increasing the priority to high. The box shape is rather uncommon though: all off-diagonal elements are half the diagonal element, which leads to box angles smaller than 60 degrees. I guess I did take such geometries into account when writing the halo communication code.
Note that the box in the tpr file is not consistent with the shape mentioned in this issue, the latter is a more usual setup.

#10 Updated by Berk Hess about 1 year ago

The unitcell here can be transformed to one that has all angles >= 60 degrees. So it looks like the solution is to have more restrictions on the unit cell in GROMACS.

#11 Updated by Erik Lindahl about 1 year ago

Berk:

1) Is that only an issue at the start of a simulation, or could it ever be distorted to reach this shape during a run?

2) Can we do the conversion automatically for the user?

#12 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2125.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2016~I2109542292beca5be26eddc262e0974c4ae825ea
Gerrit URL: https://gerrit.gromacs.org/7438

#13 Updated by Berk Hess about 1 year ago

  • Status changed from In Progress to Fix uploaded
  • Target version changed from 2018 to 2016.5

This error is even more serious than I initially thought. All boxes with b[x]!=0 and (c[y]!=0 or c[z]!=0) are affected. Some might have too many atoms communicated. Some too little, which could lead to crashes, like here, or worse: silent errors.
I am surprised we only found this bug after 12 years.
Our default rhombic dodecahedron is not affected. Our default truncated octahedron is correct, but communicates slightly to many atoms.
I uploaded a fix to release-2016. We might also want to fix this is release-5.1, but we can cherry-pick it.

#14 Updated by Berk Hess about 1 year ago

  • Status changed from Fix uploaded to Resolved

#15 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2125.
Uploader: Berk Hess ()
Change-Id: gromacs~release-5-1~I2109542292beca5be26eddc262e0974c4ae825ea
Gerrit URL: https://gerrit.gromacs.org/7442

#17 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to Closed

#18 Updated by Mark Abraham about 1 year ago

For the record, this fix is also present on release-5-1 branch, but there are no plans for further releases from that branch. If someone wants it, I recommend getting that content from a git clone.

Also available in: Atom PDF