Project

General

Profile

Bug #2765

gmxapi MPI tests don't handle many-core systems

Added by Eric Irrgang 10 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When run on a machine with more than a few cores, the gmxapi MPI-enabled tests produce a fatal error when trying to set up parallelism resources.

Fatal error:
Your choice of number of MPI ranks and amount of resources results in using 16
OpenMP threads per rank, which is most likely inefficient. The optimum is
usually between 1 and 6 threads per rank. If you want to run with this setup,
specify the -ntomp option. But we suggest to change the number of MPI ranks.

Proposed solution uses additional CMake configuration for the mdrun test fixture to sensibly configure the execution environment.

Associated revisions

Revision 3fd14b5f (diff)
Added by Paul Bauer 9 months ago

Disable gmxapi by default

Due to outstanding issues with the integration testing and tests failing
with large number of ranks, the gmxapi default has been changed to not
be build. In Jenkins, all supported builds still are still set to build
with GMXAPI enabled.

Refs #2765, #2722, #2756

Change-Id: I2cc42c461edc206aaa30be6cac3db0a52ccae991

Revision 0e7fba7c (diff)
Added by Eric Irrgang 9 months ago

Restrict number of OpenMP threads during testing.

The default testing configuration causes errors
when run on many-core systems, requiring additional user input to
guide resource allocation. This change borrows the solution used
in the mdrun tests, setting `OPENMP_THREADS 2` in the CMake macro
`gmx_register_gtest_test()`.

Refs #2765

Change-Id: Ia05ca6130b60b4b6ccdb4e13e3afe1e5c7419fd3

Revision f611c272 (diff)
Added by Eric Irrgang about 1 month ago

Add gmxapi_pytest custom target to CMake `check` target.

Automatically run pytest for the gmxapi Python package in the build tree
when GMX_PYTHON_PACKAGE=ON.

Refs #2765

Change-Id: I3f33b9903fc712d0c027810af6b6af605928f3ab

History

#1 Updated by Gerrit Code Review Bot 10 months ago

Gerrit received a related patchset '1' for Issue #2765.
Uploader: M. Eric Irrgang ()
Change-Id: gromacs~release-2019~Ia05ca6130b60b4b6ccdb4e13e3afe1e5c7419fd3
Gerrit URL: https://gerrit.gromacs.org/8715

#2 Updated by Eric Irrgang 10 months ago

  • Description updated (diff)
  • Status changed from New to Fix uploaded

#3 Updated by Gerrit Code Review Bot 9 months ago

Gerrit received a related DRAFT patchset '1' for Issue #2765.
Uploader: Paul Bauer ()
Change-Id: gromacs~release-2019~I2cc42c461edc206aaa30be6cac3db0a52ccae991
Gerrit URL: https://gerrit.gromacs.org/8801

#4 Updated by Paul Bauer 9 months ago

  • Target version changed from 2019 to 2020

This is more likely to happen in 2020 now

#5 Updated by Eric Irrgang 9 months ago

Why would this be deferred to 2020 instead of a patch to 2019? This seems like a bug to me.

#6 Updated by Erik Lindahl 9 months ago

Only because there hasn't been that much activity on it during the beta cycle :-)

One (minor) caveat is that the iterative/trial-and-error approach of focusing on one failure at a time works fine during the development phase (or for anything that isn't enabled by default), but to re-enable it by default in a patch for the stable branch I think there has to be a clear summary of a wide range of multi-core / multi-node OpenMP and MPI setups where the testing has been confirmed to pass fine.

#7 Updated by Eric Irrgang 9 months ago

Erik Lindahl wrote:

Only because there hasn't been that much activity on it during the beta cycle :-)

Unfortunately, it seems to have slipped through the cracks. :-\

One (minor) caveat is that the iterative/trial-and-error approach of focusing on one failure at a time works fine during the development phase (or for anything that isn't enabled by default), but to re-enable it by default in a patch for the stable branch I think there has to be a clear summary of a wide range of multi-core / multi-node OpenMP and MPI setups where the testing has been confirmed to pass fine.

In this case, tests were modified such that an error was introduced, and I believe that error is now fixed. The tests were in place in both forms for the beta releases, I believe, but I haven't heard of a failure case other than mine. If there were other reports, can we ask those users to test the patch?

Otherwise, I should be able to build and run on tcblab with a bit of help.

#8 Updated by Paul Bauer 9 months ago

  • Target version changed from 2020 to 2019

#9 Updated by Paul Bauer 9 months ago

  • Status changed from Fix uploaded to Resolved

#10 Updated by Paul Bauer 9 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF