Project

General

Profile

Task #2986

Post submit failing in two configurations

Added by Paul Bauer 3 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
core library
Difficulty:
uncategorized
Close

Description

Post submit has been failing for weeks now on two different configurations.

Since https://gerrit.gromacs.org/c/gromacs/+/10818 got submitted, the complex.nbnxn-ljpme-LB-geometric test has been failing every time on the

gcc-5 gpuhw=nvidia nranks=4 gpu_id=1 cuda-10.0 no-hwloc release-with-assert host=bs_nix1204

configuration.

Also, since https://gerrit.gromacs.org/c/gromacs/+/11197 the

gcc-5 simd=ARM_NEON no-hwloc release-with-assert host=bs_jetson_tk1

configuration has been timing out in the first complex test of the regressiontests.


Related issues

Related to GROMACS - Bug #2989: (thread-) MPI setup hanging on bs_jetson_tk1Closed
Related to GROMACS - Bug #3063: Long-distance bonded interaction issue with CUDA and domain decompositionClosed

Associated revisions

Revision 19ceb466 (diff)
Added by Szilárd Páll 3 months ago

Refactor tracking of GPU short-range work/skipping

This change introduces a set of flags that, for each interaction
locality, whether there are short-range interactions computed and
exposes a query in the nonbonded module's API.
This allows consistent checks for both when work has been done
and whether results need to be reduced.

Refs #2986

Change-Id: I15020d83f73a132d9b8e93d7339529176396089a

History

#1 Updated by Szilárd Páll 3 months ago

I suggest we start by given the slave a restart. Need to let the post-submit queue drain first.

#2 Updated by Szilárd Páll 3 months ago

Paul Bauer wrote:

Also, since https://gerrit.gromacs.org/c/gromacs/+/11197 the
[...]
configuration has been timing out in the first complex test of the regressiontests.

That's also the complex.nbnxn-ljpme-LB-geometric test failing, not a timeout:
http://jenkins.gromacs.org/job/Matrix_PostSubmit_master/1418/OPTIONS=gcc-5%20gpuhw=nvidia%20nranks=4%20gpu_id=1%20cuda-10.0%20no-hwloc%20release-with-assert%20host=bs_nix1204,label=bs_nix1204/consoleFull

#3 Updated by Szilárd Páll 3 months ago

PS: Looking at the post-submit history, I see no indication of timeouts. Have these resolved themselves?

#5 Updated by Paul Bauer 3 months ago

  • Related to Bug #2989: (thread-) MPI setup hanging on bs_jetson_tk1 added

#6 Updated by Mark Abraham 3 months ago

I've definitely seen timeouts, and the first commit that introduced them was definitely the one where I reorganized thread-MPI setup. I didn't think to try rebooting the slave when I had a chance, but I agree that we should try that first. I see Szilard has taken tk1 offline. If that doesn't work, please revert my commit (though I will be on holiday and will not be able to do it myself)

#7 Updated by Szilárd Páll 3 months ago

  • Status changed from New to In Progress

Both issues are WIP. IMO this should not be a "bug" as it just links two failing tests which are two different issues; so perhaps convert to a task (e.e "investigate...") and link the relevant tasks.

#8 Updated by Mark Abraham about 1 month ago

  • Tracker changed from Bug to Task
  • Affected version deleted (git master)

#9 Updated by Mark Abraham about 1 month ago

  • Status changed from In Progress to Closed

I don't think these issues are continuing

#10 Updated by Szilárd Páll about 1 month ago

  • Related to Bug #3063: Long-distance bonded interaction issue with CUDA and domain decomposition added

Also available in: Atom PDF