Project

General

Profile

Bug #1680

Multiple problems with Jenkins+GPU regressiontests

Added by Roland Schulz over 5 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

See http://jenkins.gromacs.org/job/Gromacs_Gerrit_master/8021/OPTIONS=Compiler=gcc%20CompilerVersion=4.7%20GMX_TEST_NPME=ON%20GMX_GPU=ON%20CUDA=5.5%20GMX_DOUBLE=OFF%20GMX_THREAD_MPI=ON%20GMX_SIMD=SSE4.1%20CMAKE_BUILD_TYPE=Release%20host=bs_nix1310,label=bs_nix1310/consoleFull

Problems:
- Something segfaults (see core file) in libcudart and no error is reported
- test which should run fine are rerun with CPU only and no error is reported
- tests which don't have PME are run with 3 MPI but only 2 GPU. And then rerun on CPU


Related issues

Related to GROMACS - Task #1587: improve the configurability of regression testsNew

Associated revisions

Revision 4ca48b56 (diff)
Added by Mark Abraham over 5 years ago

Fix logic for testing -npme

Renamed function to be a better description, removed redundant logic,
and fixed other logic.

Refs #1680

Change-Id: I85bedf3a6a8550a525a37d94185784d3c9054641

History

#1 Updated by Gerrit Code Review Bot over 5 years ago

Gerrit received a related patchset '1' for Issue #1680.
Uploader: Mark Abraham ()
Change-Id: I85bedf3a6a8550a525a37d94185784d3c9054641
Gerrit URL: https://gerrit.gromacs.org/4441

#2 Updated by Mark Abraham over 5 years ago

Roland Schulz wrote:

See http://jenkins.gromacs.org/job/Gromacs_Gerrit_master/8021/OPTIONS=Compiler=gcc%20CompilerVersion=4.7%20GMX_TEST_NPME=ON%20GMX_GPU=ON%20CUDA=5.5%20GMX_DOUBLE=OFF%20GMX_THREAD_MPI=ON%20GMX_SIMD=SSE4.1%20CMAKE_BUILD_TYPE=Release%20host=bs_nix1310,label=bs_nix1310/consoleFull

Problems:
- Something segfaults (see core file) in libcudart and no error is reported

Haven't looked at that yet

- test which should run fine are rerun with CPU only and no error is reported

Rerunning upon failure and re-running without the GPU are two different behaviours. I think the Jenkins-visible output looks misleading under the way we've set up the verbosity, rather than the behaviour being wrong. Those tests are failing, but we are not seeing output from (I presume) the successful GPU version that is tried next. Only then does re-running without the GPU come along.

- tests which don't have PME are run with 3 MPI but only 2 GPU. And then rerun on CPU

Fix uploaded above for an issue related here, but not sure if relevant. Jenkins will tell us. I think the behaviour of the GMX_TEST_NPME Jenkins configs on non-PME tests can be anything we want that doesn't give errors, so fail and re-run on one rank is not horrible. We could modify gmxtest.pl to have a "run only test cases with PME" mode if we think that's something we need.

As above, I don't think re-running on the CPU is part of the problem.

#3 Updated by Mark Abraham over 5 years ago

Mark Abraham wrote:

- tests which don't have PME are run with 3 MPI but only 2 GPU. And then rerun on CPU

Fix uploaded above for an issue related here, but not sure if relevant. Jenkins will tell us. I think the behaviour of the GMX_TEST_NPME Jenkins configs on non-PME tests can be anything we want that doesn't give errors, so fail and re-run on one rank is not horrible. We could modify gmxtest.pl to have a "run only test cases with PME" mode if we think that's something we need.

http://jenkins.gromacs.org/job/Gromacs_Gerrit_master/8029/OPTIONS=Compiler=gcc%20CompilerVersion=4.7%20GMX_TEST_NPME=ON%20GMX_GPU=ON%20CUDA=5.5%20GMX_DOUBLE=OFF%20GMX_THREAD_MPI=ON%20GMX_SIMD=SSE4.1%20CMAKE_BUILD_TYPE=Release%20host=bs_nix1310,label=bs_nix1310/console shows my change didn't fix anything relevant.

There's a few possible fixes:
  1. observe in gmxtest.pl that for non-PME test cases that the length of -gpu_id should divide the number of ranks and guess what to do about it (messy)
  2. observe in gmxtest.pl that this is the GMX_TEST_NPME Jenkins config and only test PME code paths (currently only possible if we react to -npme != -1, which would change the behaviour of any manual use of such settings, which is probably fine)
  3. observe in gmxtest.pl that we are re-running because of one of failure cases that are handled by the script, and be verbose about the subsequent success, before possibly doing CPU-only reruns (perhaps helpful in other cases too)
  4. separate CPU-only Jenkins configs from require-GPU configus (not sure if feasible to preserve the build artefact to run tests in two modes, but could have other uses)

#4 Updated by Mark Abraham over 4 years ago

  • Status changed from New to Rejected

Stuff seems to be working fine these days. If we're going to fix anything, it won't be in gmxtest.pl. We need to separate the ideas of testing that default behaviour works, from that specific behaviour is correct.

#5 Updated by Mark Abraham over 4 years ago

  • Related to Task #1587: improve the configurability of regression tests added

Also available in: Atom PDF