Project

General

Profile

Bug #1624

complex/nbnxn_vsite regressiontest fails when doing CPU-only rerun

Added by Szilárd Páll almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Category:
testing
Target version:
Affected version - extra info:
5.0.3-dev-20141001-a349e4b
Affected version:
Difficulty:
uncategorized
Close

Description

The tests get executed with 8 ranks which resutls in the complex/nbnxn_vsite test failing with a DD error.
This suggests that the set ranks hack likely does not work with MPI.
Correction: this happens both with MPI and tMPI; instead it looks like it's an issue with the CPU-only rerun.

Used the latest regressiontests auto-dowloaded by cmake, ran with vanilla "make check".


Related issues

Related to GROMACS - Bug #1543: complex/nbnxn_vsite regressiontests fails with GPUsClosed06/30/2014

Associated revisions

Revision 70eafb6a (diff)
Added by Szilard Pall almost 5 years ago

Respect rank/thread maximum in CPU reruns

CPU-only reruns done after GPU-enabled tests didn't respect MPI rank
and OpenMP thread count limits.

Minor refactorings to re-use variables that contain the names of the
files that limit attempted parallelism, and to clarify how the
rerun sets up its directories.

Fixes #1624

Change-Id: I58210e37129d725ef2fe2e7b1ff3c2b5ecfb5d74

History

#1 Updated by Szilárd Páll almost 5 years ago

  • Related to Bug #1543: complex/nbnxn_vsite regressiontests fails with GPUs added

#2 Updated by Szilárd Páll almost 5 years ago

  • Subject changed from complex/nbnxn_vsite regressiontest fails with MPI to complex/nbnxn_vsite regressiontest fails when doing CPU-only rerun
  • Description updated (diff)

#3 Updated by Szilárd Páll almost 5 years ago

Hopefully it's fixed, I'm still quite puzzled, how did this pass manual and automated testing. Or this CPU rerun feature not been tested?

#4 Updated by Mark Abraham almost 5 years ago

Szilárd Páll wrote:

Hopefully it's fixed, I'm still quite puzzled, how did this pass manual and automated testing. Or this CPU rerun feature not been tested?

The mechanism you used hard-codes -np 8 for GMX_LIB_MPI in tests/CMakeLists.txt. Probably, it no longer should?

Otherwise, if one tests manually on an Intel machine with a moderate number of cores, then mdrun chooses OpenMP over those cores and the problem you observed does not arise. Your observations seem likely to be possible only on a non-Intel machine, or Intel with numbers of cores greater than our heuristic limits.

#5 Updated by Szilárd Páll almost 5 years ago

Mark Abraham wrote:

Szilárd Páll wrote:

Hopefully it's fixed, I'm still quite puzzled, how did this pass manual and automated testing. Or this CPU rerun feature not been tested?

The mechanism you used hard-codes -np 8 for GMX_LIB_MPI in tests/CMakeLists.txt. Probably, it no longer should?

Not sure which mechanism id I use? Do you mean running "make check" hardcodes the number of ranks? I don't think we should, because that ensures that a certain test is only run with a certain number of ranks.

#6 Updated by Mark Abraham almost 5 years ago

Szilárd Páll wrote:

Mark Abraham wrote:

Szilárd Páll wrote:

Hopefully it's fixed, I'm still quite puzzled, how did this pass manual and automated testing. Or this CPU rerun feature not been tested?

The mechanism you used hard-codes -np 8 for GMX_LIB_MPI in tests/CMakeLists.txt. Probably, it no longer should?

Not sure which mechanism id I use? Do you mean running "make check" hardcodes the number of ranks?

Yes, since your OP reported that you used "make check." That is implemented in the file I mentioned and does what I said it does :-)

I don't think we should, because that ensures that a certain test is only run with a certain number of ranks.

Indeed.

#7 Updated by Szilárd Páll almost 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Szilárd Páll

#8 Updated by Szilárd Páll almost 5 years ago

Fixed in commit:70eafb6

#9 Updated by Mark Abraham almost 5 years ago

  • Status changed from In Progress to Closed

Also available in: Atom PDF