Project

General

Profile

Bug #1661

Bug with free-energy + GPU + 2/3D domain decomposition

Added by Berk Hess over 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With GPUs and 2D or 3D domain decomposition, non-local i-atoms perturbed atom flags are set incorrectly. This can lead to subtle errors in energy, forces and dV/dlambda. This setup is unlikely to be used a lot, since it is inefficient (and for 1D DD there is no issue).

Associated revisions

Revision 268a6b09 (diff)
Added by Berk Hess over 2 years ago

Fix bug FE + GPU + 2/3D domain decomposition

Fixes #1661

Change-Id: Ia84f6c1219a2052df0ed1c5c4d7f66c37ed7f67b

History

#1 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #1661.
Uploader: Berk Hess ()
Change-Id: Ia84f6c1219a2052df0ed1c5c4d7f66c37ed7f67b
Gerrit URL: https://gerrit.gromacs.org/4323

#2 Updated by Berk Hess over 2 years ago

  • Status changed from In Progress to Fix uploaded

#3 Updated by Szilárd Páll over 2 years ago

Why do you think that this setup is not used? There is no other way to use multiple GPUs, but to do domain-decoposition and with more than four GPUs DD should most of the time switch to 2D decomposition. So, while it is not a very likely setup, but I can still imagine ligand binding runs which could have been affected.

Hence, I suggest bumping the priority.

We really need that extension to the regressiontests setup.

#4 Updated by Mark Abraham over 2 years ago

Szilárd Páll wrote:

Why do you think that this setup is not used? There is no other way to use multiple GPUs, but to do domain-decoposition and with more than four GPUs DD should most of the time switch to 2D decomposition. So, while it is not a very likely setup, but I can still imagine ligand binding runs which could have been affected.

Hence, I suggest bumping the priority.

We really need that extension to the regressiontests setup.

What needs extending? bs_nix1204 runs nbnxn-free-energy with 2D DD. Replacing, sure...

#5 Updated by Szilárd Páll over 2 years ago

Mark Abraham wrote:

What needs extending? bs_nix1204 runs nbnxn-free-energy with 2D DD.

As I've mentioned before, we are missing testing for a lot of acceleration and parallelization code-paths. Other setups with 2D/3D decomposition, multi-threading, GPU setups, and the list continues.

Replacing, sure...

Getting off-topic here, but let me answer briefly this time because this is getting old. As far as I can tell, that's your personal opinion that the developer community never agreed on. For instance, both Roland and I were on the opinion that currently there is nothing on the horizon that can take up the role that regression-testing has (that is integration tests).

#6 Updated by Mark Abraham over 2 years ago

Szilárd Páll wrote:

Mark Abraham wrote:

What needs extending? bs_nix1204 runs nbnxn-free-energy with 2D DD.

As I've mentioned before, we are missing testing for a lot of acceleration and parallelization code-paths. Other setups with 2D/3D decomposition, multi-threading, GPU setups, and the list continues.

None of that pertains to this issue. We had a test case and test setup that ran free-energy code with 2D DD and GPUs and that design did not detect this issue wrt to the single-threaded reference run. I presume the discrepancy was too small for the floating-point tolerances we use with gmx check. So extending the scope of the parallelization code paths seems unlikely to help detect it. We'd need a tighter acceptable range of error (and someone to put in the time to test that range on all the proposed contexts), or a different kind of test. See my proposals at #1587.

Replacing, sure...

Getting off-topic here

Discussion continued at #1587.

#7 Updated by Mark Abraham over 2 years ago

  • Subject changed from Buf with free-energy + GPU + 2/3D domain decomposition to Bug with free-energy + GPU + 2/3D domain decomposition
  • Status changed from Fix uploaded to Resolved

The code issue is resolved, but how to test better is a discussion for elsewhere

#8 Updated by Mark Abraham about 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF