Project

General

Profile

Bug #2113

Google tests and execution order

Added by Aleksei Iupinov almost 3 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
testing
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Google test binaries support --gtest_shuffle argument which randomizes execution order of tests within each test case.
Passing this argument to the current Gromacs tests breaks at least one test case (GenconfTest in GmxPreprocessTests) - see Jenkins results http://jenkins.gromacs.org/job/Gromacs_Gerrit_master_nrwpo/2526/.
Using this argument would facilitate pinning down flaky bugs such as #2093 and should be supported in all tests.

Associated revisions

Revision 2cce9cb6 (diff)
Added by David van der Spoel over 2 years ago

Fix non-reproducability of clustsize tests.

Due to the use of static variables in the old command-line
parsing codes, running the same test repeatedly would "inherit"
settings from the previous test. This is fixed here by
removing the static flags and updating some test results.

Part of #2113

Change-Id: Ie5bdf4aaefb8acfdb07a4ee65652a1c3c8ca84e6

Revision 583b76df (diff)
Added by David van der Spoel over 2 years ago

Fixed test-order dependency in gmxpreprocess tests.

Due to the use of static variable in genconf.cpp and
solvate.cpp, used for initializing command-line args
the result of the tests were dependent on the order of
execution. The reason is that static variables set in one
of the test runs are "inherited" by the next run. A
particular example is a randum number seed, which when 0
is set to a random seed, but just the first time around. This
is obviously not a good idea when aiming for reproducability.

Part of #2113.

Change-Id: I8bd8d36c75dcd20824e7f3db7268512387cd04ca

Revision 90b8d167 (diff)
Added by David van der Spoel over 2 years ago

Removed unnecessary static declarations for command-line parsing

For historical reasons command-line parsing needed static variables
when declaring the t_pargs structures, however modern compilers
do not need this anymore. Simultaneously, repeatedly calling a function
containing static variables may lead to irreproducible results.
The programs fixed here have no unit tests, however grompp and pdb2gmx
are tested in the regression tests, which pass.

Fixed uninitialized variable that turned up when not static anymore.

TODO: fix the same in analysis tools.

Part of #2113.

Change-Id: I1f67cc54362062289264b83078d3317f8e914faf

History

#1 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related patchset '2' for Issue #2113.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~Ic92745a2e50f0f1a0194672a635b295a8caa3e7f
Gerrit URL: https://gerrit.gromacs.org/6448

#2 Updated by Mark Abraham almost 3 years ago

Agree that we want such things to work, but I think an appropriate way to manage it is to test periodically that it works, e.g a weekly job or something. Non-reproducibility in per-commit testing leads to problems where you don't know whether the updated patch worked because the issues is resolved, or because e.g. the changed execution order hides it.

#3 Updated by Aleksei Iupinov almost 3 years ago

I think that a random test order would not allow many newly-introduced bugs to slip by, especially if we also add --gtest_repeat=5 for multiple iterations.
If exhaustive testing is undesirable to be default behavior, then it can be a weekly/post-submit/manual target only.
It should be something that developers are aware and make use of, though.

#4 Updated by Aleksei Iupinov almost 3 years ago

You are implying that the test order is conceptually part of test reproducibility.
I don't see why would it be? You can run any individual test case and test separately using command line arguments.
This should already produce no failures if your test is correct.
Test sequence order is not part of the testing functionality, is it? :-)

#5 Updated by Mark Abraham almost 3 years ago

Aleksei Iupinov wrote:

I think that a random test order would not allow many newly-introduced bugs to slip by, especially if we also add --gtest_repeat=5 for multiple iterations.

Repeating by default is probably undesirable. We'd have to somehow make clear to developers which repeat failed - just the console log isn't great. It would also run longer, which ought not be a problem for well implemented tests, but quite a few GoogleTests currently rely on infrastructure that does write physical files, etc. (Those could be refactored, or shifted to less frequent testing.) But I don't see a benefit worth the work, yet.

If exhaustive testing is undesirable to be default behavior, then it can be a weekly/post-submit/manual target only.

First we'd have to do the work to get that working (probably not big), and split the current tests into those categories.

It should be something that developers are aware and make use of, though.

Have a look in docs/developer-guide, I don't know if we mention shuffle and repeat there.

#6 Updated by Mark Abraham almost 3 years ago

Aleksei Iupinov wrote:

You are implying that the test order is conceptually part of test reproducibility.
I don't see why would it be? You can run any individual test case and test separately using command line arguments.
This should already produce no failures if your test is correct.
Test sequence order is not part of the testing functionality, is it? :-)

In a perfect world, the order would not matter, but when there's a big black box people are working on top of, they like it to have predictable behaviour because that builds trust (even if not deserved...) Learning how to make it predictable is mental burden that should have a benefit, if it is to be accepted.

I think the most likely sources of bugs caught by shuffling are
  • that someone forgot to specify a random seed and the generation is somehow stateful (possibly that's the issue with the gmxpreprocess-test failures), and
  • use of uninitialized memory (for which we should add more of the Sanitizers and probably reinstate more valgrind).
    For those, an infrequent run whose reason is to implement shuffling, should suffice.

#7 Updated by Aleksei Iupinov almost 3 years ago

Mark Abraham wrote:

Have a look in docs/developer-guide, I don't know if we mention shuffle and repeat there.

It only mentions getting command line help and filtering the tests.

In a perfect world, the order would not matter, but when there's a big black box people are working on top of, they like it to have predictable behaviour because that builds trust (even if not deserved...) Learning how to make it predictable is mental burden that should have a benefit, if it is to be accepted.

I see your point.

Use of uninitialized memory (for which we should add more of the Sanitizers and probably reinstate more valgrind).

Is there more sanitizing available to add though?

For those, an infrequent run whose reason is to implement shuffling, should suffice.

Agreed.

#8 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2113.
Uploader: David van der Spoel ()
Change-Id: gromacs~master~Ie5bdf4aaefb8acfdb07a4ee65652a1c3c8ca84e6
Gerrit URL: https://gerrit.gromacs.org/6786

#9 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2113.
Uploader: David van der Spoel ()
Change-Id: gromacs~master~I8bd8d36c75dcd20824e7f3db7268512387cd04ca
Gerrit URL: https://gerrit.gromacs.org/6787

#10 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2113.
Uploader: David van der Spoel ()
Change-Id: gromacs~master~I1f67cc54362062289264b83078d3317f8e914faf
Gerrit URL: https://gerrit.gromacs.org/6788

#11 Updated by Mark Abraham over 1 year ago

Aleksei Iupinov wrote:

Mark Abraham wrote:

Use of uninitialized memory (for which we should add more of the Sanitizers and probably reinstate more valgrind).

Is there more sanitizing available to add though?

Our thread sanitizer usage doesn't cover relevant cases (particularly OpenMP, and CUDA). Leak Sanitizer is suppressed from AddressSanitizer. Undefined Behaviour Sanitizer isn't implemented.

#12 Updated by Mark Abraham over 1 year ago

Is there any further need of work here?

#13 Updated by Aleksei Iupinov over 1 year ago

Not unless someone wants to set up a nightly shuffle test. I don't :-)

Also available in: Atom PDF