Project

General

Profile

Bug #1732

review and extend jeknins tests setups, coverage

Added by Szilárd Páll over 4 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Jenkins
Target version:
-
Affected version - extra info:
Affected version:
Close

Description

[ RFC - sorry for the spam ]

The jenkins setup has gotten to a point where IMO it has grown issues which warrant a serious review and as I see it an extensive overhaul is also badly needed. Both in terms of setups/configuration as well as practices we need to improve.

Several issues have been noticed recently regarding configurations that went missing (CUDA 6.5) or some that were added in an inconsistent manner (AVX2 only in rel-5.0). Additionally, verification configs are lacking many common/important use-cases (e.g. AVX_128_FMA which even the hypervisor does support, GPU builds on anything but AVX + CUDA <=5.5, etc.). We can only avoid this if things get better documented, changes made get recorded - and possibly even reviewed by others.

Additionally, I think we need to give up trying to squeeze out good coverage from the verification matrix asap. This is inefficient (leads to large build slave usage spikes), leaves many setups untested, and ultimately the verification setups are at the mercy of what the few maintainers of jenkins consider important/good enough at the time of doing tweaks (e.g. apparently Phi is more important than AVX2 or CUDA>5.5 on master :).
Instead I think we should picking the most important and most widely used configs (in terms of platforms/compilers/lib dependencies etc.) for verification and do a lot more configs as nightly/weekly. I do realize the that this comes with the inconvenience of not getting some issues caught in verification, but having wider coverage and better average build slave hardware utilization.

Last but not least, if/when we manage to redesign the configs, we should get better at communicating to devs (and users) what exactly do we test. Two wiki pages should IMO be enough: one for the users to tell them what and how often do we test (as we discussed years ago, define level of "testedness"), and another one that summarizes the exact setups of build slaves, configurations, etc. so devs can know what/how are we testing without having to log into a slave, as , or guess.


Related issues

Related to Support Platforms - Feature #1601: use Git for Jenkins ConfigNew09/21/2014

History

#1 Updated by Erik Lindahl over 4 years ago

I largely agree - there are a large number of recurring false errors, which in turn leads to an even higher load when we need to retrigger things.

Highly architecture-specific code is probably the least important to test, since it will normally have been developed and tested on the hardware in question (there is no way I would have been able to write e.g. VSX code without testing on a Power8). Conversely, the most important things to test are likely things that few of us work on, but where we have a long-term wish that we don't want to screw things up (say, Windows or a couple of non-standard compilers). It would be great with a big-endian testbed for the same reason, but I'm not sure there is any easy way to achieve that without resorting to emulation.

Some random thoughts:

0) I agree that we need to start debugging Jenkins. What documentation do we have about the current setup, and are there any ways we can start to help even without prior experience?

1) Hardware is reasonably cheap, unless we need highly specialized hardware or multiple copies of hardware to guarantee access. Maybe it would be more cost-efficient to simply have a bunch of desktops-class servers that are more replacable, so we rapidly disable a few tests if server 7 fails? Compiles are disk-heavy, so I'm not sure how much slower things are due to the VM setup compared to physical servers.

2) In most cases only one job out of 25-30 fails. Is there a way we can allow retriggering just the failed server instead of all of them when we don't change the commit?

3) As a developer who tends to work in batches, it can initially be frustrating to understand what all the new tests in Jenkins are, why/how things have changed, and what the new requirements are. It would be great to have a webpage where we collect all changes and document behavior.

4) I would love a systematic naming of the hosts and some sort of login procedure so it is easy to log in to the server manually and debug when the error is more complex.

#2 Updated by Mark Abraham over 4 years ago

Sounds good. I wish there was a way to version the OPTIONS matrix, but it seems there is not. So at least the versioning of a wiki page would be good

Erik on email last night:

PS: To focus of efforts better, it is probably time to start thinking about a tiered system. Tier-0 would be everything that is i gerrit. Tier-1 is everything we test before a release (which will be a bit in flux), and for Tier-2 systems we will fix things if we can (which might require access to hardware, that the bug is in Gromacs, or at least that the compiler workaround is trivial).

For Tier-0 we would then do what it takes to work around compiler bugs (or deprecate support if the compiler is unfixable), and for Tier-1 it’s best-effort. For Tier-2 we might tell people things are broken, but we won’t invest any resources to fix it.

After Gromacs-5.1 we should probably move both K-computer and BG/Q to Tier-2, IMHO.

So perhaps something like

  • Tier 0 = per-commit tested on Jenkins; we pick a set of versions and code paths from which we make a representative sample (including one or two special-case really-old configs for safety?); perhaps we run an all-against-all matrix before even patch releases; expectation is zero warnings and zero failed tests
  • Tier 1 = weekly tested on Jenkins; generally older versions of Tier 0 stuff; expect zero warnings and zero failed tests
  • Tier 2 = stuff we care about and will try to test before releases, but no promises (e.g. Power*, Cygwin, Mingw, K, BG/Q, anything for which we can't set up a Jenkins slave); compiler warnings are permitted (but people can fix them if they want to); test failures should go on the TODO list to fix
  • Tier 3 = stuff we expect should work, but we're unlikely to invest any time unless it's clearly our problem and there's hardware and logins readily available to get the job done (e.g. K-computer and at least BG/Q with xlc after 5.1)

#3 Updated by Mark Abraham over 4 years ago

Erik Lindahl wrote:

I largely agree - there are a large number of recurring false errors, which in turn leads to an even higher load when we need to retrigger things.

Yes. I hope the fep_state fix just merged will quiet some of them.

Highly architecture-specific code is probably the least important to test, since it will normally have been developed and tested on the hardware in question (there is no way I would have been able to write e.g. VSX code without testing on a Power8). Conversely, the most important things to test are likely things that few of us work on, but where we have a long-term wish that we don't want to screw things up (say, Windows or a couple of non-standard compilers). It would be great with a big-endian testbed for the same reason, but I'm not sure there is any easy way to achieve that without resorting to emulation.

I have been considering trying http://www.appveyor.com/ as a replacement for our own Windows configs. Even if we have to pay for access to a few simultaneous builds, that might well be cost-effective.

Some random thoughts:

0) I agree that we need to start debugging Jenkins. What documentation do we have about the current setup, and are there any ways we can start to help even without prior experience?

That's hard. I complained about it myself when I started here. It's always in a state of flux and nobody has time to write docs. We use http://redmine.gromacs.org/projects/gmx-sysadmin/boards as a place to discuss things, and there are some docs there. Mostly Stefan has collected some docs here http://redmine.gromacs.org/projects/gmx-sysadmin/wiki/Index.

1) Hardware is reasonably cheap, unless we need highly specialized hardware or multiple copies of hardware to guarantee access. Maybe it would be more cost-efficient to simply have a bunch of desktops-class servers that are more replacable, so we rapidly disable a few tests if server 7 fails? Compiles are disk-heavy, so I'm not sure how much slower things are due to the VM setup compared to physical servers.

My impression is that the Windows builds disk usage sucks more than the others, but we are also horrible in the way we test (e.g. we do 42 grompp + energy minimization runs in order to test pdb2gmx). Otherwise, I think the hardware mix is OK. I'd still like some hardware to dedicate for performance regression testing, but we never seem to have money for that. There are perhaps some ways to use Intel or Allinea hardware-counter tools to observe e.g. unexpected changes in AVX instruction counts that would serve as partial proxies for real timing.

2) In most cases only one job out of 25-30 fails. Is there a way we can allow retriggering just the failed server instead of all of them when we don't change the commit?

Yes, the matrix-reloaded plugin is installed for that. See left hand bar when you're on a Jenkins job webpage. Sometimes it even updates the patch with the new result (but I'm not sure what the expected behaviour is there).

3) As a developer who tends to work in batches, it can initially be frustrating to understand what all the new tests in Jenkins are, why/how things have changed, and what the new requirements are. It would be great to have a webpage where we collect all changes and document behavior.

I think the GROMACS side of the equation (ie non-Jenkins) is largely covered here http://jenkins.gromacs.org/view/Gerrit%20master/job/Documentation_Gerrit_master/javadoc/ but people are not yet in the habit of knowing where to look. We could do more, there are plenty of TODOs noted, suggestions are welcome, etc.

I don't think a changelog of Jenkins configs is going to help such a developer much. What would you want to know that we haven't covered in GROMACS docs?

4) I would love a systematic naming of the hosts and some sort of login procedure so it is easy to log in to the server manually and debug when the error is more complex.

Enabling LDAP logins would seem to be the local solution. Roland has his own logins already, and he's basically the only one who needs/wants one.

#4 Updated by Roland Schulz over 4 years ago

#5 Updated by Szilárd Páll over 4 years ago

Mark Abraham wrote:

Some random thoughts:

0) I agree that we need to start debugging Jenkins. What documentation do we have about the current setup, and are there any ways we can start to help even without prior experience?

That's hard. I complained about it myself when I started here.

So sure, the documentation is lacking and was not a priority in the early times. Additionally, the initial setup was rather simple, far simpler than the current one which has grown a lot in complexity since I took my hands off of it (deployment, docs, 4 versions, etc. - I haven't even tried to keep track of it), but docs have still not been a priority, not even for the new stuff.

It's always in a state of flux and nobody has time to write docs. We use http://redmine.gromacs.org/projects/gmx-sysadmin/boards as a place to discuss things, and there are some docs there. Mostly Stefan has collected some docs here http://redmine.gromacs.org/projects/gmx-sysadmin/wiki/Index.

IMHO the problem is that the jenkins setup is still maintained and extended in an ad-hoc, un-documented, and largely on a per-need per-personal wish/opinion basis. And additions have been made plenty, but not even the tiny bit of user-/dev-facing docs (on the wiki) have been maintained.

1) Hardware is reasonably cheap, unless we need highly specialized hardware or multiple copies of hardware to guarantee access. Maybe it would be more cost-efficient to simply have a bunch of desktops-class servers that are more replacable, so we rapidly disable a few tests if server 7 fails? Compiles are disk-heavy, so I'm not sure how much slower things are due to the VM setup compared to physical servers.

My impression is that the Windows builds disk usage sucks more than the others, but we are also horrible in the way we test (e.g. we do 42 grompp + energy minimization runs in order to test pdb2gmx). Otherwise, I think the hardware mix is OK.

I agree the mix is OK. We have the hardware, but we don't even make use of it. E.g. we have AMD Bulldozer hypervisors (and KVM can pass CPID flags to the VMs), but we don't even tests such builds.

I'd still like some hardware to dedicate for performance regression testing, but we never seem to have money for that. There are perhaps some ways to use Intel or Allinea hardware-counter tools to observe e.g. unexpected changes in AVX instruction counts that would serve as partial proxies for real timing.

We have several dev boxes, e.g. the two new 8-core Haswell machines. I don't get what kind of specialized hardware is necessary to do at least the bare minimum: check for SIMD-parallelizatiod and light threading/DD. Already doing single-threaded performance regression tests would have caught a number of regressions that I noticed months after the breakage simply because I compare with previously obtained logs - something that even a machine could do.

4) I would love a systematic naming of the hosts and some sort of login procedure so it is easy to log in to the server manually and debug when the error is more complex.

Enabling LDAP logins would seem to be the local solution. Roland has his own logins already, and he's basically the only one who needs/wants one.

LDAP could be enabled, but I don't think it's necessary (likely everybody who would ever need/want to log in knows our localadmin passwords) and it poses a security risk as long as restrictions can't be placed on who is and who is not allowed to log in with their LDAP account. We could collect and some SSH keys and use those to allow login of developers to the build slaves (or attempt setting up LDAP subset allow rules).

I don't think hostnames should matter much, the wiki page was aimed at documenting the build slave configs but it has not been updated for nearly a year (for 2+ years if we don't consider the "Logging into build slaves" section).

#6 Updated by Teemu Murtola over 4 years ago

Szilárd Páll wrote:

So sure, the documentation is lacking and was not a priority in the early times. Additionally, the initial setup was rather simple, far simpler than the current one which has grown a lot in complexity since I took my hands off of it (deployment, docs, 4 versions, etc. - I haven't even tried to keep track of it), but docs have still not been a priority, not even for the new stuff.

Maybe not directly related to actual Jenkins documentation, but there is ~13k words of documentation in docs/dev-manual/ in the main repository, and this gets automatically built (or would, if the Documentation_Nightly_master job worked; I now reconfigured it to be much simpler and triggered it manually, but we'll see whether the SCM poll is still broken...), and the output is linked from the wiki. And most of this documentation is focusing on the stuff that Jenkins does in the documentation and uncrustify jobs. It could still use collecting the Jenkins-related information and links to a single location. The part that is not documented is of course how Jenkins is configured related to this, but that documentation cannot really live in the main repository.

IMHO the problem is that the jenkins setup is still maintained and extended in an ad-hoc, un-documented, and largely on a per-need per-personal wish/opinion basis. And additions have been made plenty, but not even the tiny bit of user-/dev-facing docs (on the wiki) have been maintained.

Dismissing those 13k words of existing documentation as "unmaintained" because they are not on the wiki is, um, not very motivating for those people (mostly a single person) who has put a lot of effort there, with very little feedback at any point...


I will try to collect my thoughts on the actual Jenkins configuration issue some time soon, but here's a copy-paste from an exchange in Gerrit that is somewhat related:

Erik:

... long-term it's probably not worth polluting our makefiles with highly compiler-specific workaround just to hide warnings that might really be the compilers' fault.

me:

I think we would really benefit from a clear statement of which compilers we want to support and to which level (e.g., which warnings we want to eliminate), on a level detailed enough for developers. It would help in discussions of this kind (as well as, e.g., the C++11 feature discussion that was over e-mail at some point), as well as provide very useful input for things like #1732. And it should consider how do we support commercial compilers that not everyone has access to.
Since you seem to have quite strong opinions on this, I would suggest that you sketch such a statement (possibly after discussion within Stockholm). A very natural place would be somewhere in the docs/dev-manual/ structure; it will likely be quite static, and any changes will be tied to source code changes, so I don't think a wiki page works. Also, Gerrit will provide a natural place to discuss the policy, so that by the time it is merged (or any changes to it are merged), people have a common understanding of what is the plan going forward.

Mark:

Agree we should have and document policy, particularly with the various changes planned. To make sure I get back to this after 5.1, Teemu, can you post thoughts like the above at #1732, please?

#7 Updated by Erik Lindahl over 4 years ago

Hi,

I've created a separate task for C++11 (http://redmine.gromacs.org/issues/1745) where we can also discuss compilers.

#8 Updated by Teemu Murtola over 4 years ago

Some smaller thoughts:

If we want to avoid usage spikes in Jenkins for the per-commit verification and make the typical per-commit verification faster, it probably makes sense to configure the jobs such that whatever we run on a post-merge/nightly/weekly basis can also be triggered for a change waiting for review in Gerrit if needed. So that if a particular change is suspected to break something, it can be verified manually (ideally, we would also get Jenkins to post the results of such manually triggered verifications back to Gerrit). But if such extra verification is necessary for only one change out of 100, it might be better to not run it for every single patch set, unless it is cheap to do so. It just needs to be easy and documented how to trigger these manual verifications.

#9 Updated by Teemu Murtola over 4 years ago

In order to get somewhere with this, we should get concrete on what should be done, instead of discussing just on a general level that "we should have different levels of verification". And that requires everyone to pitch in with what they think are important, and how important relative to the cost they incur.

I also think that we should think separately about compilation and other verification. Currently, everything in the verification matrix runs every test. But we could easily test different compilation options without such extensive testing, allowing us to extend the coverage in different parts separately. It makes sense to check that the binaries compiled from each setup work, but for example, I see no value in running all the tests for builds that only differ in, e.g., GMX_BUILD_OWN_FFTW (if we would test it, to avoid bugs like the one that slipped into 5.0.5), or in GMX_BUILD_MDRUN_ONLY.

If we need to prioritize what we test for each patch set, I would put emphasis on things that block development: the code should always compile warning-free on the environments used by various developers, and pass basic tests. Extensive testing of each GMX_SIMD option, for example, I would leave more to on-request/post-merge testing. Subtle issues in mdrun can also be difficult to catch, and would probably benefit from on-request testing. If/when the blocker for more complicated build setups (like multiple chained Jenkins jobs) has been the inability to report them back to Gerrit, this should also lift some of the limitations as there could be less interacting plugins in the post-merge testing. We can still have Jenkins run most of the tests (except for the most expensive ones) more or less immediately after the merge, and e-mail people who broke the tests, and provide visibility on the state of the branch.


On the technical side, the more complicated/varied testing we include in Jenkins, the more important it becomes that the interaction between Jenkins and the build system remains stable. To me, that ideally means that the contents of the tests (and also the builds) are mostly specified in the source repository, not in a separate releng repository or in the Jenkins job configuration, so that they can evolve in sync with the code that they test. In particular, adding new tests for newly added/changed functionality should not break the build for all the old changes still pending in Gerrit.

There is value in having the ability to add new build setups without changing the source repository, but ideally we would have also an ability to add new build setups in sync with changes in the source repository that require them. In a large number of cases (like gcc-5 recently), adding a new setup anyways requires some changes in the source repository. And the more we can move the detailed specifications to the source repo, the less need we will have for maintaining separate copies of the per-change builds for each branch, reducing the possibility that they go out of sync.


Similar to what I said at some point on gmx-developers in a discussion on regression testing, if the limiting factor is people to do the actual implementation (and not deciding what that implementation should do), I can volunteer to help with scripting required for the approach (if people think that this would be the most valuable contribution I can make). I cannot promise to deliver things in O(days), but if the alternative is not getting anything done, it should not be a big deal to wait a few weeks or months. And I cannot take any big responsibility in Jenkins maintenance, since I cannot dedicate or predict the amount of time I have at a given moment, and the infrastructure should remain running.

For discussing the general approach to verification, it would be very useful to have some documentation and experiences from the current setup; people with admin rights to Jenkins can reverse-engineer relatively easily what it currently does, but most valuable for this development would be the why part, i.e. what alternatives have been tried and why they haven't worked as well as the current setup.

#10 Updated by Erik Lindahl over 4 years ago

We've talked a bit about Jenkins, and I think we have a plan in place to improve the hardware side of things (but we'll wait until after 5.1 has been released to avoid screwing up things last-minute). We got to take over ~15 simple dual quad-core AMD nodes from the supercomputing center, and we will likely get even newer hardware from the next cluster they retire very soon.

The current plan is to move to using the Docker plugin for Jenkins. This way we could have a large pool of different hardware, but that is flagged by node capabilities (e.g. AVX2 or CUDA-GPU), and rather than specifying a static host for each test, the tests would have similar flags that are matched to the available hosts. I think this will remove at least some of the current bottlenecks, and it should both make it easier to add new hardware and avoid having single-points-of-failure. If any node is acting up, we just take it offline.

I very much like the idea of moving the test specifications to the source directory. Then we could simply have lists of combinations we should test e.g. for every commit, every night, or every month.

#11 Updated by Teemu Murtola over 4 years ago

Hardware and redundancy in the build slaves will help, but then you need to set up infrastructure that ensures that software on the machines is equivalent and/or add labels also for all software, which leads to an explosion of combinations. At least for per-patch set verification, I think it suffices to test a single Linux distribution (and Mac+Windows), and focus on our own build options. That should reduce the variability in the node configuration as well.

#12 Updated by Szilárd Páll over 4 years ago

Teemu Murtola wrote:

Szilárd Páll wrote:

So sure, the documentation is lacking and was not a priority in the early times. Additionally, the initial setup was rather simple, far simpler than the current one which has grown a lot in complexity since I took my hands off of it (deployment, docs, 4 versions, etc. - I haven't even tried to keep track of it), but docs have still not been a priority, not even for the new stuff.

Maybe not directly related to actual Jenkins documentation, but there is ~13k words of documentation in docs/dev-manual/ in the main repository, and this gets automatically built (or would, if the Documentation_Nightly_master job worked; I now reconfigured it to be much simpler and triggered it manually, but we'll see whether the SCM poll is still broken...), and the output is linked from the wiki.

No, it is not directly related and I have to admit, I am not very up to date in this respect. Have not had time to look into it, only recall reviewing part of these docs on relocatable binaries a while ago.

And most of this documentation is focusing on the stuff that Jenkins does in the documentation and uncrustify jobs. It could still use collecting the Jenkins-related information and links to a single location. The part that is not documented is of course how Jenkins is configured related to this, but that documentation cannot really live in the main repository.

I agree, and let me emphasize, I have not been criticizing how the specific tasks (like uncrustify and docs generator/checking) are implemented and documented, but rather the actual CI setup that carries out these and a bunch of other tasks.

IMHO the problem is that the jenkins setup is still maintained and extended in an ad-hoc, un-documented, and largely on a per-need per-personal wish/opinion basis. And additions have been made plenty, but not even the tiny bit of user-/dev-facing docs (on the wiki) have been maintained.

Dismissing those 13k words of existing documentation as "unmaintained" because they are not on the wiki is, um, not very motivating for those people (mostly a single person) who has put a lot of effort there, with very little feedback at any point...

Again, sorry if it sounded like I dismissed those docs, I did not mean to. It simply did not even cross my mind to consider those developer docs as part of the infrastructure documentation.

On the other hand, if feedback is lacking (which given the level of interaction on gmx-dev lately is not too surprising), could that be also because people don't know (enough) about it? My guess is that most devs are not even aware of the existence of this very valuable documentation and don't know that help would be needed in editing/reviewing. Perhaps dropping a brief mail to the gmx-dev list as an FYI/RFC would help.

And let me note that I have seen the 2-3 review requests you sent me, but I simply did not have time to honor them - yet.

#13 Updated by Szilárd Páll over 4 years ago

I agree with the need to reduce the non-critical components of the verification. Moving a lot of the current matrix verification cases to post-merge or nightly/weekly tests makes sense together with increasing the number of such tests substantially. I also agree that some of the verification does not need to run the full set of unit & regression tests and some could even avoid running code at all (to test cross-compilation or compilation for hardware that we may not have - e.g. Win+GPU).

A 'quick-check' target could be useful for the non cross-compilation cases, this could (initially?) have a dummy implementation that only verifies that the binary does work.

Teemu Murtola wrote:

On the technical side, the more complicated/varied testing we include in Jenkins, the more important it becomes that the interaction between Jenkins and the build system remains stable. To me, that ideally means that the contents of the tests (and also the builds) are mostly specified in the source repository, not in a separate releng repository or in the Jenkins job configuration, so that they can evolve in sync with the code that they test. In particular, adding new tests for newly added/changed functionality should not break the build for all the old changes still pending in Gerrit.

Good points about stability and the need to not break previous changes. However, keeping complete test configs in the source repository sound both like and overkill and also hard to accomplish. There needs to be some external configuration that hardcodes specifics of our infrastructure, I think.

Teemu Murtola wrote:

Hardware and redundancy in the build slaves will help, but then you need to set up infrastructure that ensures that software on the machines is equivalent and/or add labels also for all software, which leads to an explosion of combinations. At least for per-patch set verification, I think it suffices to test a single Linux distribution (and Mac+Windows), and focus on our own build options. That should reduce the variability in the node configuration as well.

More hardware means more work (at least short to mid-term) and potentially little benefit - compared to
the redesign of the jenkins setup which I believe it is what would really help.

The two types of workloads discussed here are quite different: making verification faster is a latency problem while increasing the overall test coverage is a throughput problem. While more hardware will hardly help with the former - at least not without: wasting a lot of resources (on average), investing quite some effort and ensuring that the latency impact of software like Docker is minimal may or may not be a dealbreaker. However, it would certainly be great for the latter case to increase the sheer number of nightly/weekly tests and with that the coverage of code and software+hardware combinations.

As you can see on the monitoring page, the average hardware utilization measured for the last year is 11% and this is an overestimate as i) both hypervisors and the VMs running on these are included in the measements (hence double counting busy cores) and we've had rogue mdrun's running in the background for days or weeks in the past. How much more over-provisioning would be needed to get the verification down to the desired 10-15 minutes without much redesign of the test setup?
So to conclude, of course, more hardware is not useless, but likely not of primary importance especially given the non-trivial amount of work required to just put to use new hardware before it can used and that this effort will have to diverted from other tasks.

#14 Updated by Roland Schulz over 4 years ago

We can reduce the latency immediately if we 1) reduce the number of configurations a lot and move most to past submit and 2) come up with a fast way to test Windows.

1) We just need to agree on which tests we want pre submit.
I suggest that 5 configs:
- TSAN (using newest gcc)
- ASAN (using clang)
- MSVC (non-posix + non C90)
- Double (using oldest gcc)
- ICC

would be sufficient for pre commit. And if we only run 1 job per host than those jobs run in 10min: http://jenkins.gromacs.org/job/Gromacs_Gerrit_master/8929/. Also then if multiple patches are uploaded at the same time, they wouldn't need to wait on each other.

2) This would leave fixing Windows/MSVC. And I think that is important, because it is relative common that those fail because someone forgot about non-posix or non-C99. Which probably can be fixed the easiest by having a non-virtualized Windows machine (maybe with SSD).

#15 Updated by Erik Lindahl over 4 years ago

I agree that windows is important not because we have any significant fraction of users there, but because the compilers (and OS environment) is very different. I've already planned to use non-VM hosts for windows - I might even try to submit an order for the hardware tonight ;-)

#16 Updated by Teemu Murtola over 4 years ago

I agree that we cannot probably get all the config into the source repository, but I don't think we need to, either. Rather, we should put some effort into making the interface between Jenkins and the code in the source repository more stable, which means that this interface should not be used for anything else. Configuration directly related to Jenkins would still live in Jenkins (like, which tests to run on which hosts/labels, etc.).

For example, it is counterproductive if most changes to the user-/dev-visible interface in the build system causes Jenkins build failures and rebasing needs. I think something like https://gerrit.gromacs.org/#/c/4648/ is the way to go; it will just be somewhat more complicated for the full compilation/verification/installation/testing setup, but certainly doable, and at least some of the effort will pay back in the future.

The essential question is whether we want to be able to add new verification setups without changing the source repository, i.e., such that they apply also retrospectively to older changes. If we do, then the configuration needs to live outside the source repo, but it could still be in releng instead of directly as a single string in Jenkins. Having the configuration as a script somewhere provides much more flexibility than the current single-line OPTIONS string. We can still introduce some type of label system that allows us to introduce new verification setups that only runs if the source code specifies the corresponding label. We may not want such complexity initially, but it is good to keep this in mind when selecting the overall approach.


For the pre-submit verification, I think that we really need at least the following to keep the code in a compilable state (as a somewhat separate axis from the compilers, although there shouldn't be any big need to test more than one compiler with any of these):

  • GMX_DOUBLE on and off
  • GMX_MPI on and off
  • GMX_GPU on and off
  • With and without C++11 support (until we really decide that it is required)

Depending on how we think about their possible interactions, this means anything between two and ten configs.

Additionally, GMX_BUILD_MDRUN_ONLY=ON needs to be tested (it is not currently), but this just means that one of the configs should do a second build of this (or a separate config for this). This config doesn't really need to run any tests, except possibly some sanity checks.

And if we use ASAN for one or two (maybe both single and double), there should be no need to run valgrind, speeding up the tests significantly.

#17 Updated by Erik Lindahl over 4 years ago

I think the mdrun-only option is deprecated since everything lives in the gmx binary now - there is no separate mdrun binary.

It also makes sense to have an OS X config in Jenkins (we might have to update that hardware) since it's slightly different, but a very common platform. However, this could probably be combined with one of the other tests. With the new OpenCL support we likely want tests both for CUDA & OpenCL to make sure changes in one architecture doesn't break the other.

Hardware is cheaper than person-time, so if we can save work (but not swap it for sysadm work...) I don't mind getting more hosts, within reason.

#18 Updated by Szilárd Páll over 4 years ago

Roland Schulz wrote:

We can reduce the latency immediately if we 1) reduce the number of configurations a lot and move most to past submit and 2) come up with a fast way to test Windows.

I agree!

1) We just need to agree on which tests we want pre submit.
I suggest that 5 configs:
- TSAN (using newest gcc)
- ASAN (using clang)
- MSVC (non-posix + non C90)
- Double (using oldest gcc)
- ICC

If we omit testing with multiple versions of gcc and clang I don't think we should include icc either - it's not a very "hot" compiler, to be honest :).

Or rather, I would suggest keeping the current strategy of testing a decent (but perhaps reduced) set of compilers and important libraries. Otherwise, I'm afraid there will be a lot of fixing of cosmetic stuff neended (discovered by post-merge tests). Also, the majority of developers do and will test only with more their default setup and we really want to avoid the hassle that comes with commits that accidentally introduce compilers warnings or break builds with some common dependency.

Hence, I suggest not omitting the latest 3 or so version of gcc, clang, at least an icc setup and some CUDA-enabled configs too.

And if we only run 1 job per host than those jobs run in 10min:

I mentioned 10-15 min because I remembered that somebody mentioned it as a preferred target. However, I should have made it clear: I do not think that reducing verification time to a lot below 30 min is so important. Does it really make such a big difference to get verification results 10-15 min eailerler? I'm not sure it does unless we want to cater for little to no offline testing and "just push it up to test if it compiles" workflow. However, while convenient, encouraging the use of the jenkins server and its verification feature for unverified, non-merge ready code just to avoid running uncrustify, doxygen, or tests locally risks creating even larger workload spikes during merge rush periods. Already now such seemingly prematurely pushed and published commits create quite some noise and clutter on the gerrit page.

2) This would leave fixing Windows/MSVC. And I think that is important, because it is relative common that those fail because someone forgot about non-posix or non-C99. Which probably can be fixed the easiest by having a non-virtualized Windows machine (maybe with SSD).

Do you have an idea what the actual limiter is? Last time I tested (admittedly a long time ago), synthetic random and sequential r/w tests did not show anomalously low IO performance. However, the compilation part seems to have always been very slow.

The trouble with more and more iron is that more "skinny" machines we have with OS running on bare metal the larger the maintenance cost and the higher the risk of loosing a critical verification config without an easy way of fallback (like migrating the VM).

#19 Updated by Erik Lindahl over 4 years ago

I've ordered hardware for a separate windows box, so we can forget about that for now - hopefully that will solve the extremely slow windows-VM builds.

And, when thinking about it I think we can get rid of the OS X slave too. Several developers use OS X laptops, so it's very unlikely that we'll completely miss bugs on OS X - and that will remove one very special configuration.

#20 Updated by Szilárd Páll over 4 years ago

Erik Lindahl wrote:

I think the mdrun-only option is deprecated since everything lives in the gmx binary now - there is no separate mdrun binary.

The mdrun-only build had a well-defined role with 5.0 and did not exist just for backward compatibility.

The MPI build the mdrun-only target still does make sense, I think, and the idea of creating a separation between the very-portable, performance-oriented reduced code-base that can be selected with an mdrun-only build is still very relevant, isn't it? As we discussed before, this should be a way to ensure that the core mdrun functionalities compile even with rough and edgy HPC compiler even as the rest of the code-base goes all in with C++, extra dependencies, etc.

I've ordered hardware for a separate windows box, so we can forget about that for now - hopefully that will solve the extremely slow windows-VM builds.

There is a danger in having a single Win non-virtualized build slave as part of verification, I think. If it breaks, verification will go offline until the machine gets fixed. In contrast, when one of the hypervisors' PSU broke, the VMs were migrated and everything could continue.

And, when thinking about it I think we can get rid of the OS X slave too.

No need to drop it, OSX configs can be moved to post-merge testing.

#21 Updated by Roland Schulz over 4 years ago

Szilárd Páll wrote:

There is a danger in having a single Win non-virtualized build slave as part of verification, I think. If it breaks, verification will go offline until the machine gets fixed. In contrast, when one of the hypervisors' PSU broke, the VMs were migrated and everything could continue.

Unless we always have more than 1 server which can run any pre-submit job. E.g. if the pre-submit Windows/MSVC job is allowed to run on either the AMD or Intel Windows host than it would work with either slave being offline. And we could still test Intel and AMD on Windows in post-submit. And for the post-submit it is far less critical if verification isn't working for a certain time. A disadvantage of running the same job on different hardware is that it makes it less reproducible. But if we have the same configuration for both AMD and Intel in post-submit one would still know that any error showing up is caused by the commit and not by the different hardware (if the most recent post submit is fine and the commit is based on master).

#22 Updated by Teemu Murtola about 4 years ago

Szilárd Páll wrote:

I mentioned 10-15 min because I remembered that somebody mentioned it as a preferred target. However, I should have made it clear: I do not think that reducing verification time to a lot below 30 min is so important. Does it really make such a big difference to get verification results 10-15 min eailerler? I'm not sure it does unless we want to cater for little to no offline testing and "just push it up to test if it compiles" workflow. However, while convenient, encouraging the use of the jenkins server and its verification feature for unverified, non-merge ready code just to avoid running uncrustify, doxygen, or tests locally risks creating even larger workload spikes during merge rush periods. Already now such seemingly prematurely pushed and published commits create quite some noise and clutter on the gerrit page.

I agree that 30 min by itself is not that much, but currently it quickly adds up; if one actually tries to be nice to the reviewers and split changes into smaller pieces, it quickly means that it takes several hours to get the verification results for all the changes. There are also different levels of "just push it up to test if it compiles"; I don't think we want to encourage everyone to keep half a dozen build trees and everyone to install CUDA, MPI, etc. on all their development environments, just so that they can test everything locally. We should just encourage people to not do unnecessary rebases and not upload less-important stuff when there is a lot of activity, e.g., before a release.

As for the Windows builds being slow, it is just my general experience (from non-Gromacs stuff) that compiling things just appears to be slow at least on MSVC. I've heard from a colleague that setting up a ramdisk and storing the source code and intermediate build results there helped him somewhat, but it's not just disk-intensive.

#23 Updated by Teemu Murtola about 4 years ago

I would suggest this kind of general setup for the jobs:
  1. Per-patchset verification runs jobs like it currently does
    • The exact compiler setup and compilation options for the matrix job needs to be decided, but preferably it is less than it currently is to avoid filling up all the executors with a single upload. Also, whether all of these run all the tests etc. is up for discussion.
    • I think we should also keep the other jobs, except for the coverage one in this set, unless they really start taking up too much resources. We could investigate avoiding some unnecessary builds (e.g., not running the matrix job if a commit only touches docs/), but there is not that much to gain there.
    • We could also invest in making the results more user-/developer-friendly. I did some of that for the documentation and uncrustify jobs at changes leading up to https://gerrit.gromacs.org/#/c/4695/; feedback is welcome. This could also include investigating (again) whether we could use, e.g., a chain of jobs to do some tasks, like running the tests as a separate Jenkins job.
  2. Verification that runs after merging.
    • Initially, this could just be the current matrix job, possibly with some tweaks.
    • I'm not sure whether an SCM poll or a Gerrit trigger on the merge event is a better option for this.
      • If the network or some other external environment is flaky, Gerrit trigger could miss some changes, but SCM poll would eventually trigger. On the other hand, SCM poll might not build each merged change individually.
      • At minimum the author of the change should be informed if things break, but Gerrit trigger would allow the results to be posted back.
      • It has some value to keeping the builds sequential, i.e., that the build numbering corresponds to the order things are actually in the repo, would be nice (I'm not sure whether Gerrit trigger would preserve this if multiple changes are merged with a single action).
  3. On-demand verification.
    • I think that for each job that we do not run for each patchset, there should be a separate job that can be triggered manually, so that risky changes can be tested before merging.
    • One potential option for this is to use Gerrit trigger to trigger on specifically formatted comments in Gerrit (e.g., commenting "[REQUEST] Coverage" would trigger the coverage build, and report results back once the build is done).
  4. If we have some really expensive testing (like heavy performance runs), we could run those more seldomly (nightly or weekly) using an SCM poll.

To get moving with this, we really should decide on the per-patchset verification setup so that it is as useful as possible; most other things can be done incrementally, and initially we can just do the current matrix post-merge to keep the current level of verification.

#24 Updated by Teemu Murtola about 4 years ago

We should also have some policy about what we do with old jobs in Jenkins. Could we, for example, get rid of the 4.5-era definitions now, and 4.6 soon after? Also, there are several jobs that have a "test" in their name or otherwise do not seem to be part of the production system, and haven't been built in several months or years (I think it is fine to have some testing builds going on in Jenkins, but ideally they should be removed once no longer necessary, and the build either taken into use, or the lessons learned documented somewhere). Getting rid of unnecessary builds could help in understanding the setup.

#25 Updated by Teemu Murtola about 4 years ago

Now, let's try to get something actually done. https://gerrit.gromacs.org/5003 and https://gerrit.gromacs.org/4880 provide a new set of scripts that are not strictly necessary for the discussion here, but do (hopefully) solve some other issues with the Jenkins setup. And http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-new-releng/ contains a test setup for testing a per-patchset matrix build using those scripts. As I outlined in https://gerrit.gromacs.org/5003, now would be the ideal time to decide on the contents of the per-patchset verification matrix. But so far, the discussion hasn't been very focused.

The current matrix contains eight configurations, mostly selected from the old matrix, and has builds for the following (in my mind relevant things):
  • Oldest supported gcc (was gcc-4.1 at the time I set up the matrix, but now it might be gcc-4.6).
  • Newest gcc that is available on Jenkins (5.1)
  • clang 3.4 (oldest from the old matrix) and 3.6 (similarly newest)
  • GPU and non-GPU builds, with oldest supported CUDA (4.0) and the newest (7.0).
  • Single- and double-precision builds.
  • MPI and thread-MPI builds, with OpenMP.
  • Builds using ASAN and TSAN.
  • Windows build using the oldest MSVC that we want to support.
  • One mdrun-only build.
What is missing (that I think should be there):
  • Consider whether we want to build GMX_SIMD=Reference or GMX_SIMD=None as the default, and which builds would make sense to build the other.
  • One build that does not use thread-MPI nor MPI.
  • X11 build (previous build was on bs_centos63; does any other host have the necessary libraries and headers installed?)
  • valgrind or some other memory leak checker at least for one of the configs
  • Consider which builds would make sense as release builds
  • Consider if some of the builds should use icc/mkl or atlas.
  • OpenCL? (wasn't there in the old matrix, either)

Currently, the verification takes ~10 minutes when Jenkins is idle, so there is easily space to add a few more configs, but around ten configs should probably be sufficient to cover this. One notable thing that is missing by intention is testing different SIMD setups and other platform-specific stuff; that easily makes the combinations explode, and I think the essential part here is trying to get as large coverage as possible for compilation of the codebase and generic issues. Separate, less frequent builds can be used for testing a wider combination of platform-specific things.

Please add if you think something important is missing, and I would welcome help in selecting the exact setup and choosing hosts that can actually run the different builds.

#26 Updated by Erik Lindahl about 4 years ago

We might want to avoid GMX_SIMD=Reference for any builds that use high optimization - those builds can be quite slow since there are literally thousands of small loops the compiler is trying to understand in each kernel.

I like the idea to have one really barebones build where we disable everything we can.

Let's not bother with Atlas or MKL. GROMACS isn't a BLAS-intensive application, and we do have a built-in version for the simple stuff. While it's technically possible to link with another library, we should also decide where to invest our efforts, and I don't think looking into potential problems with Atlas makes our top-100 list.

When it comes to setting things up on specific hosts, be aware that we're working on getting an OpenCloud setup going, so shortly we hope to be able to run things e.g. through the Kubernetes plugin to jenkins, with flags for features rather than hard node names. Since I'm not the best expert on the current configuration, I'm not sure if we would prefer to wait until we have those hosts online?

(In particular, we didn't want to do anything that might f*sck with Jenkins before we had 5.1 out)

#27 Updated by Teemu Murtola about 4 years ago

I agree that testing ATLAS or MKL for correctness is not high on priority list. But avoiding the compilation of all the BLAS/LAPACK stuff for most of the setups would have some speed advantages. And if we test icc, we might as well test that linking against MKL works.

What is shortly? Will it be in production use in a week, a month, a few months? Will those plugins work together with the matrix build plugin, or do we need to completely restructure everything? I would love a more flexible setup where it would just be possible to specify stuff as "compile with this compiler using these options", and possibly assign a few tags for some exotic stuff that is necessary for that to work, instead of having to know which host do I need to select to be able to use those options.

One thing that I forgot from my list is that we should explicitly have one setup in our per-patchset matrix that uses the minimum supported CMake version, and that should explicitly list the CMake version in the matrix (not like it is now, that the CMake version is implicitly determined from whatever happens to be installed on the host, and different hosts have different versions). We might also want one that has the newest CMake so that we catch violations of policies introduced in new versions.

#28 Updated by Erik Lindahl about 4 years ago

Can't say anything about timing, since we haven't done it before, but if openstack is too complicated we'll just use docker on a bunch of plain hosts, so think end-of-the-month. That doesn't mean I don't want anything changing before then, but it's just a warning so people don't get too pissed off if we need to modify the setup again pretty soon :-)

#29 Updated by Roland Schulz about 4 years ago

Does valgrind have any advantage over LeakSanitizer? According to the documentation LeakSanitizer should be on by default. But that seems to be true only starting with 3.5. This seems to be the reason our existing ASAN build doesn't fail (the tests fail on my machine with 3.5). We used to run valgrind on all unit tests. But now we have the mdrun tests and they have leaks. Should we run only the real unit tests with leak checker (either valgrind or LSAN)?

#30 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Does valgrind have any advantage over LeakSanitizer? According to the documentation LeakSanitizer should be on by default. But that seems to be true only starting with 3.5. This seems to be the reason our existing ASAN build doesn't fail (the tests fail on my machine with 3.5). We used to run valgrind on all unit tests. But now we have the mdrun tests and they have leaks. Should we run only the real unit tests with leak checker (either valgrind or LSAN)?

I've never used LeakSanitizer, so I couldn't say. At least it probably doesn't integrate similarly with ctest, but that may not be a big deal. And it seems to be possible to only use it selectively for some executions, but not others, and to have suppressions, so for leak checking it's probably fine. Something that may require extra effort is getting reasonable error reports from Jenkins (and not just have the errors buried somewhere in the console log). This is/was currently at least somehow working with valgrind, so any replacement should also have the same property.

We didn't run it for all the tests, only for those that had a specific CTest label. We could just fix those labels to actually only include tests that we want (and with the new releng script, the labels could easily be changed to something more meaningful without breaking the builds, since renaming the labels only requires changes in one repo). And valgrind had an unselective suppression for everything leaked from mdrun.

#31 Updated by Mark Abraham about 4 years ago

Roland Schulz wrote:

Does valgrind have any advantage over LeakSanitizer?

In combination, the Sanitizers seem to outperform valgrind on all fronts e.g. https://code.google.com/p/address-sanitizer/wiki/ComparisonOfMemoryTools. In particular, some STL implementations don't work well with the valgrind memory leak checking, which is hard-coded by CTest to be always on. CTest also has some support for Sanitizer use, and I think overall the direction of life is Sanitizers >> valgrind. We do have an XSL in releng that converts valgrind's XML to JUnit XML. Not sure who wrote it, but as Teemu notes, it would be much better if we had some way to get Sanitizer output useful in Jenkins without users reading the console log. I'm prepared to have a go at that.

I can buy keeping a single hassle-free valgrind config working, but I think it should be on x86, support no external dependencies, use GMX_SIMD=none (valgrind lags behind hardware in SIMD support in released versions), and we ignore at least STL memory leak checking. I know valgrind is Berk's go-to tool for debugging stuff, so making sure it continues to be useful is valuable there.

According to the documentation LeakSanitizer should be on by default. But that seems to be true only starting with 3.5. This seems to be the reason our existing ASAN build doesn't fail (the tests fail on my machine with 3.5).

Sorry, too much indirection for me to understand. You're saying the ASAN build requires clang < 3.5 so that LeakSanitizer is not implicitly used? In the long term, I think we should want to manage the issues so that LeakSanitizer can be run. There is CTest support for ASAN_OPTIONS env var which I imagine can be used to turn off LeakSanitizer as an interim measure.

We used to run valgrind on all unit tests.

Subject to a significant number of filters in releng...

But now we have the mdrun tests and they have leaks. Should we run only the real unit tests with leak checker (either valgrind or LSAN)?

The various integration test machinery may well have leaks, but the mdrun code they test is even more leaky. They've never run with valgrind, because (as Teemu notes) the integration-test registry function didn't get the ctest label "GTest" in src/testutils/TestMacros.cmake. In my ideal world, we move ~all regressiontests functionality to this machinery and make all the sanitizers work, but a first step is fixing issues in the existing machinery and I will do/help with that.

Supporting older LLVM tooling versions is of course not required - their purpose is to find bugs and we want the latest and greatest for that. I think we should keep an old LLVM build config (e.g. first clang version that supported C++11) but its purpose is to compile and run the code correctly and e.g. working sanitizer support is not expected or encouraged.

#32 Updated by Teemu Murtola about 4 years ago

Mark Abraham wrote:

Roland Schulz wrote:

Does valgrind have any advantage over LeakSanitizer?

In combination, the Sanitizers seem to outperform valgrind on all fronts e.g. https://code.google.com/p/address-sanitizer/wiki/ComparisonOfMemoryTools. In particular, some STL implementations don't work well with the valgrind memory leak checking, which is hard-coded by CTest to be always on. CTest also has some support for Sanitizer use, and I think overall the direction of life is Sanitizers >> valgrind. We do have an XSL in releng that converts valgrind's XML to JUnit XML. Not sure who wrote it, but as Teemu notes, it would be much better if we had some way to get Sanitizer output useful in Jenkins without users reading the console log. I'm prepared to have a go at that.

Yes, the sanitizers are probably the way to go, but unfortunately it isn't exactly trivial to use them, either. valgrind has its issues, but on platforms where it is supported (which unfortunately doesn't include the newest OS X), it's relatively straightforward to use. But on OS X, I've never gotten clang to work acceptably to actually try out the sanitizers. The Apple clang doesn't support any of them, even though it is based on 3.6. I again tried to install clang-3.6 from MacPorts, but it fails to link some boost code correctly, resulting in missing symbols that prevent linking Gromacs... If not for that, I might even be interested in helping with this (preferably in combination with the new releng script, not hacking something more into the existing script).

#33 Updated by Teemu Murtola about 4 years ago

And in response to Erik's comments, technical changes in the underlying platform should not affect (significantly) the set of configurations we would like to test for each patch set. So even if we do not fully implement things before the system changes, there is nothing preventing us from selecting the configs we want. And implementing as much of that as is reasonable, using the existing installations on the hosts.

#34 Updated by Erik Lindahl about 4 years ago

One more configuration we might want to test: The reference build, since that's what we rely on for all the other tests. That might actually be a good candidate for the minimal setup.

#35 Updated by Mark Abraham about 4 years ago

Update: I have built Jenkins jobs that take admin/some-config-file.txt from the source repo (http://jenkins.gromacs.org/job/Test%20putting%20config%20matrix%20for%20Gromacs_Gerrit_master%20into%20source%20repo/) and use the contents to parameterize a matrix build using both the new releng scripts and the dynamic axis plugin (http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-nrwpo/).

While tweaking things, I sometimes had issues with Jenkins needing to make a temporary file that had the job name and the options matrix in the file name, which got too long, so that's an issue we may need to manage at some point.

Stefan has already built a Jenkins slave in a docker instance, so I will work on deploying that so that the matrix build can call it for one job. Then probably I need to work out how to provision multiple dynamic docker instances so that more than one matrix line can be run simultaneously on multiple docker slaves.

Other TODO items
  • add CUDA to a docker image and see if it can be made work - check with Iman for tips here
  • work out how to use the feature tags to choose which build slave will get sent each matrix line (rather than hard-coding a host)
  • (later) work out how to build (say) a CUDA build on a docker instance, save artifacts somehow, and run the tests in a downstream Jenkins job on a slave with a real GPU (may be tricky to do in combination with needing a matrix job?)

#36 Updated by Teemu Murtola about 4 years ago

Some discussion on the technicalities of the approach using a file in the source repo for the matrix configuration is in Mark's current draft of that in Gerrit.

For docker, how do you manage getting the git repositories into the docker instance? If you use a static image that always spins up at the same state, it means that every build will need to do a full clone of all the repositories, which is going to be an issue. And even if you prepopulate it for some build workspaces, this issue hits again whenever someone creates a new build job that uses a different workspace.

#37 Updated by Erik Lindahl about 4 years ago

Docker isn't full encapsulation (like virtualization), but it uses "union file systems", so the docker container can still see the entire state of the underlying filesystem and potential other containers we have exposed to it.

Thus, for Jenkins we would likely have one container with the entire development environment we need, and then a second container with an up-to-date git repo. Docker images are aggressively cached on nodes, so we could likely just update this image once a day or so, or we can make a habit of creating a normal git repo in /tmp on all nodes.

#38 Updated by Teemu Murtola about 4 years ago

Erik Lindahl wrote:

Docker isn't full encapsulation (like virtualization), but it uses "union file systems", so the docker container can still see the entire state of the underlying filesystem and potential other containers we have exposed to it.

Thus, for Jenkins we would likely have one container with the entire development environment we need, and then a second container with an up-to-date git repo. Docker images are aggressively cached on nodes, so we could likely just update this image once a day or so, or we can make a habit of creating a normal git repo in /tmp on all nodes.

Something like that will likely work, but probably only in combination with something like git clone --reference, or possibly git clone --depth (which apparently both can be configured in Jenkins). With anything else, we easily end up copying the whole repository to every development environment container on every build, since we still need an independent working copy in each.

#39 Updated by Mark Abraham about 4 years ago

Teemu Murtola wrote:

Some discussion on the technicalities of the approach using a file in the source repo for the matrix configuration is in Mark's current draft of that in Gerrit.

For docker, how do you manage getting the git repositories into the docker instance? If you use a static image that always spins up at the same state, it means that every build will need to do a full clone of all the repositories, which is going to be an issue. And even if you prepopulate it for some build workspaces, this issue hits again whenever someone creates a new build job that uses a different workspace.

We've had some discussion already on email, and I think we expect we can have a host-local git clone that gets updated reasonably frequently (if needed). The docker image should be able to mount that (at least read only). git is able to do various kinds of shallow clone, so probably the build script can find something useful to do. Maybe git fetch on the host git repo (if write permissions work well; this would be efficient if some other docker instance will do the same git fetch shortly; also much of the git blob content will still be around when someone updates the repo under test during the next round of verification), then a shallow git clone into docker-instance temporary file space (effectively just a file copy). There's a balance between total network traffic and inter-docker-instance contention for updating the host-local git clone that we'll have to evaluate when something is working, but I think the above will work at least reasonably.

Edit: comment written in parallel, but we're all more-or-less on the same page

#40 Updated by Erik Lindahl about 4 years ago

Another option is to place it on shared storage. Stefan & I have been reading up, and the experimental support for shared file storage in Ceph seems quite stable. If that works it might be a good option. In contrast to NFS you just keep increasing bandwidth with more servers (and we have a dozen or so), and we would avoid the bottleneck of the single jenkins server.

#41 Updated by Mark Abraham almost 4 years ago

One thing that concerns me with the prospect of post-submit verification is timeliness of fixes.

Once we have issues identified post-commit, and not yet fixed+reviewed+submitted, then every new post-submit verify will fail (unless somehow we can separate the "known" issues" from the ones newly introduced). This might mean people stop looking at post-submit verifies because they assume someone else made the problem. Then people will get annoyed by having other people's coding timelines in their head space.

When some post-submit-issue fix is submitted, in principle the post-submit verifies on everything submitted in the mean time have to be re-run, because we don't generally know if there were issues being masked.

One feature we could add to reduce exposure to these kinds of problems would be to make it easy to run (some of the) post-submit verifies as pre-submit, and have Jenkins post the result on the review page. Given that someone's chosen to add such verifies, then I guess those verifies should vote (and ideally get re-run automatically for future patch sets?). One way to implement that is to use git notes. For example, a releng script can also get the git notes, parse them for tags, and run such post-submit jobs. This way, when we have a fix for the Mac build system, we can have the Mac slave post-submit verifies voting on it.

Otherwise, or perhaps in addition, then I think we need a policy that if someone hasn't posted a plausible fix for an issue identified in post-submit verification of their commit in whatever period of time we agree (2 weeks?), then anybody may use git revert on the offending patch, and submit that patch it immediately it passes pre-submit verify.

Regardless, I still do not like that these workflows place the onus on people to do work handling problems created by other developers. Where do we even discuss post-submit problems? By default, the now-submitted review discussion doesn't show up in gerrit. I get scores of email from gerrit every day, so they go into a folder I never read. So someone (else) has to open a redmine? We could automate that, perhaps, but that does't scale easily over concurrent unrelated post-submit issues.

#42 Updated by Teemu Murtola almost 4 years ago

I think we should put some trust into the developers to actually trying to keep the code running... Or is that too optimistic to ask? And most of the time, the amount of changes merged per time unit is not that big, so there will be time for people to fix things without it affecting a mass of changes... If/when we configure Jenkins to send e-mails to people who broke the post-submit builds, it will also create some social pressure since probably this will work as in other CI systems: the person who initially breaks the build will get an e-mail for every subsequent failing build, and the recipient list will grow with the author of each subsequent change.

And as you say, we should make it possible to run at least some (if not all) of the post-submit jobs on-demand. But I would never run them automatically, as the whole point of them is that they can be expensive. It doesn't make sense to run them every time someone updates the commit message or a comment on a big change... People will hopefully be responsible and have some idea of the impact of their changes, and can then request the necessary verification for those patch sets where it makes sense. Probably the simplest way to set this up is to have jobs that trigger on Gerrit comments formatted according to some pattern.

But you cannot have everything: extensive test coverage with some ensemble tests etc. and a large amount of configurations, everything verified per-patchset, and even reasonably efficient usage of Jenkins hardware. And if we anyways will have some tier-1 configurations, we will anyways have to deal with these issues.

#43 Updated by Gerrit Code Review Bot almost 4 years ago

Gerrit received a related patchset '1' for Issue #1732.
Uploader: Mark Abraham ()
Change-Id: I9e53a30eefcbc056c765e0a5e02fb04c8be1369d
Gerrit URL: https://gerrit.gromacs.org/5038

#44 Updated by Szilárd Páll almost 4 years ago

Teemu Murtola wrote:

The current matrix contains eight configurations, mostly selected from the old matrix, and has builds for the following (in my mind relevant things):

Is the drastically reduced selection a temporary thing or a switch planned for production? It seems to me that such a reduction in testing fragile components of the code (e.g. SIMD module, GPU arch-specific code-paths) can lead to issues when related modules are changed; e.g. the same way in AVX_128_FMA was simply disabled in parts of the code which went unnoticed for months because nobody tested, broken SSE4.1 with gcc <=4.8 could go unnoticed for quite some time too.

What is missing (that I think should be there):
  • Consider whether we want to build GMX_SIMD=Reference or GMX_SIMD=None as the default, and which builds would make sense to build the other.

Why not SIMD=Auto as long as build and run stages of the verification are not decoupled?

  • One build that does not use thread-MPI nor MPI.

IMO enough as nightly.

  • X11 build (previous build was on bs_centos63; does any other host have the necessary libraries and headers installed?)

IMO enough as nightly.

  • valgrind or some other memory leak checker at least for one of the configs

Would be good.

  • Consider if some of the builds should use icc/mkl or atlas.

I'd test MKL as nightly.

  • OpenCL? (wasn't there in the old matrix, either)

IMO it should've been in 5.1 too. Or is OpenCL not considered tier-1?

One notable thing that is missing by intention is testing different SIMD setups and other platform-specific stuff; that easily makes the combinations explode, and I think the essential part here is trying to get as large coverage as possible for compilation of the codebase and generic issues.

As I see it would be better to either avoid defaulting to SIMD=None (for the reasons outlined above) or add explicitly a few SIMD tests. It could make things hard to reproduce to have matrix configs execute on random nodes with different SIMD capability, so I acknowledge that SIMD=auto may not be advantageous.

However, I think it's important that the tier-1 supported SIMD and accelerator architectures get explicitly tested in verification. The alternative is to do such tests as part of nightly builds, but given the strong and frequent focus of the GROMACS project and its devs on performance engineering I think it's more beneficial to plan on catching e.g. SIMD+compiler combination issues early rather than having broken code in the repository because the responsible developer is happy with the merge and does not consider fixing the issue urgent anymore.

#45 Updated by Teemu Murtola almost 4 years ago

Szilárd Páll wrote:

Teemu Murtola wrote:

The current matrix contains eight configurations, mostly selected from the old matrix, and has builds for the following (in my mind relevant things):

Is the drastically reduced selection a temporary thing or a switch planned for production? It seems to me that such a reduction in testing fragile components of the code (e.g. SIMD module, GPU arch-specific code-paths) can lead to issues when related modules are changed; e.g. the same way in AVX_128_FMA was simply disabled in parts of the code which went unnoticed for months because nobody tested, broken SSE4.1 with gcc <=4.8 could go unnoticed for quite some time too.

However, I think it's important that the tier-1 supported SIMD and accelerator architectures get explicitly tested in verification. The alternative is to do such tests as part of nightly builds, but given the strong and frequent focus of the GROMACS project and its devs on performance engineering I think it's more beneficial to plan on catching e.g. SIMD+compiler combination issues early rather than having broken code in the repository because the responsible developer is happy with the merge and does not consider fixing the issue urgent anymore.

You explicitly asked for reduction in usage spikes in the description. And just take a look at the history of changes in Gerrit; how many of those would benefit from exhaustive testing of all SIMD architectures? It's more like <10% than "strong and frequent focus". You cannot have such exhaustive testing for every patch set in every change and avoid those usage spikes at the same time. What I proposed (and still would do) is to have a post-submit verification that runs such more exhaustive testing for various platform-specific things; this is also in line with the original request and with Erik's suggestions. And I also proposed (and still would do) that those can be triggered for relevant changes on demand before submitting them, if there is any reason to think that such verification would be useful.

And I also already said that if we cannot trust people to fix their own stuff in reasonable time when notified by Jenkins, what can we do? If nothing else helps, we just revert the stuff.

What is missing (that I think should be there):
  • Consider whether we want to build GMX_SIMD=Reference or GMX_SIMD=None as the default, and which builds would make sense to build the other.

Why not SIMD=Auto as long as build and run stages of the verification are not decoupled?

Because that makes it close to impossible to see what is actually covered, which I would assume is a big contributor to missing configurations from the old matrix in the first place. It also leaves the whole setup vulnerable to someone accidentally changing the underlying VM setup such that the auto-selection picks something incorrect. And it also leaves the test coverage vulnerable to possible issues with the auto-selection logic if the only way to notice that something is wrong is to now and then check from the console log what actually got selected.

  • One build that does not use thread-MPI nor MPI.

IMO enough as nightly.

Either is fine for me.

  • X11 build (previous build was on bs_centos63; does any other host have the necessary libraries and headers installed?)

IMO enough as nightly.

Well, in the past few weeks there have already been several patches that would have broken X11 compilation (and at least twice things got broken before it was added back to the matrix). Given that this adds a negligible overhead on per-patchset basis, I don't really see why we wouldn't have it there.

#46 Updated by Mark Abraham almost 4 years ago

Szilárd Páll wrote:

Teemu Murtola wrote:

The current matrix contains eight configurations, mostly selected from the old matrix, and has builds for the following (in my mind relevant things):

Is the drastically reduced selection a temporary thing or a switch planned for production?

Temporary, as I said in my email to gmx-developers 13 days ago. Changing releng, debating and changing matrix coverage, putting the matrix(es) into a repo, and re-working things with Docker cannot possibly be done well in one stage. Currently we're stuck on how to implement in-repo matrices, preferably without hard-coded build-slave assignments, that trigger a matrix-like build that people will be happy to use. Discussion at http://redmine.gromacs.org/issues/1815. Once the matrices are in the repo, then the debate on what test configurations should go where can hopefully be reasonably straightforward, and we can't "lose" configs.

  • OpenCL? (wasn't there in the old matrix, either)

IMO it should've been in 5.1 too. Or is OpenCL not considered tier-1?

As you've doubtless noticed, I haven't had time to use your new CUDA installs on the build slaves either. :-)

#47 Updated by Szilárd Páll almost 4 years ago

Teemu Murtola wrote:

Szilárd Páll wrote:

Teemu Murtola wrote:

The current matrix contains eight configurations, mostly selected from the old matrix, and has builds for the following (in my mind relevant things):

Is the drastically reduced selection a temporary thing or a switch planned for production? It seems to me that such a reduction in testing fragile components of the code (e.g. SIMD module, GPU arch-specific code-paths) can lead to issues when related modules are changed; e.g. the same way in AVX_128_FMA was simply disabled in parts of the code which went unnoticed for months because nobody tested, broken SSE4.1 with gcc <=4.8 could go unnoticed for quite some time too.

However, I think it's important that the tier-1 supported SIMD and accelerator architectures get explicitly tested in verification. The alternative is to do such tests as part of nightly builds, but given the strong and frequent focus of the GROMACS project and its devs on performance engineering I think it's more beneficial to plan on catching e.g. SIMD+compiler combination issues early rather than having broken code in the repository because the responsible developer is happy with the merge and does not consider fixing the issue urgent anymore.

You explicitly asked for reduction in usage spikes in the description.

Indeed, I did, but I don't think we defined "spikes" very clearly and I'm not sure how far is it worth reducing the set of parameters the verification covers just to cater for the rare peak times (when the verification time can increase well above 30 min). Also, we now have much more hardware too which is the only (other) way to reduce verification time, so at least the build phase should not be problem especially if the planned build/test decoupling can be implemented efficiently.

And just take a look at the history of changes in Gerrit; how many of those would benefit from exhaustive testing of all SIMD architectures? It's more like <10% than "strong and frequent focus".

Sure, but would you not agree that not all gerrit changes can be considered with the same weight when it comes to how important is it to catch issues with it early? Even though the amount of arch/hw-related changes is most probably <10%, these often fall into the category that are best fixed early because they can be painful and annoying later.

You cannot have such exhaustive testing for every patch set in every change and avoid those usage spikes at the same time.

I still think that instead ~12 configs (assuming there'll be a few more) with a single one of them SIMD-enabled, having many or most of them SIMD (and/or GPU-acceleration) enabled would suffice in covering the mainstream architectures.

And I also already said that if we cannot trust people to fix their own stuff in reasonable time when notified by Jenkins, what can we do? If nothing else helps, we just revert the stuff.

It seems to me that as long as a change is hot in gerrit and needs attention so it can pass review, things tend to move swiftly, but as soon as it is merged issues can quickly get postponed and become demoted to low priority. I see this as a necessary consequence of busy volunteer-contributors.

Hence, keeping a change "hot" in gerrit while important issues are fixed seems like a good reason to explicitly test those technical aspects of the code (e.g. acceleration, parallelization, GPU offload) that are error-prone and reasonably easy/lightweight to test.

What is missing (that I think should be there):
  • Consider whether we want to build GMX_SIMD=Reference or GMX_SIMD=None as the default, and which builds would make sense to build the other.

Why not SIMD=Auto as long as build and run stages of the verification are not decoupled?

Because that makes it close to impossible to see what is actually covered, which I would assume is a big contributor to missing configurations from the old matrix in the first place. It also leaves the whole setup vulnerable to someone accidentally changing the underlying VM setup such that the auto-selection picks something incorrect. And it also leaves the test coverage vulnerable to possible issues with the auto-selection logic if the only way to notice that something is wrong is to now and then check from the console log what actually got selected.

Yeah, as I wrote later I did realize the potential issues with this.

  • One build that does not use thread-MPI nor MPI.

IMO enough as nightly.

Either is fine for me.

On a second thought, doesn't the reference build turn off both thread-MPI and MPI?

  • X11 build (previous build was on bs_centos63; does any other host have the necessary libraries and headers installed?)

IMO enough as nightly.

Well, in the past few weeks there have already been several patches that would have broken X11 compilation (and at least twice things got broken before it was added back to the matrix). Given that this adds a negligible overhead on per-patchset basis, I don't really see why we wouldn't have it there.

Sure, I did not realize it was error-prone code. I'd still suggest a pragmatic approach: just because testing something is nearly free, it may still be worth omitting it from the verification if it's not of high priority - if nothing else to keep the verification job specs simple.

#48 Updated by Szilárd Páll almost 4 years ago

Mark Abraham wrote:

Szilárd Páll wrote:

Teemu Murtola wrote:

The current matrix contains eight configurations, mostly selected from the old matrix, and has builds for the following (in my mind relevant things):

Is the drastically reduced selection a temporary thing or a switch planned for production?

Temporary, as I said in my email to gmx-developers 13 days ago.

OK, I should re-read that mail.

As you've doubtless noticed, I haven't had time to use your new CUDA installs on the build slaves either. :-)

No, admittedly I have not, I did not get that far yet.

#49 Updated by Teemu Murtola almost 4 years ago

Szilárd Páll wrote:

And just take a look at the history of changes in Gerrit; how many of those would benefit from exhaustive testing of all SIMD architectures? It's more like <10% than "strong and frequent focus".

Sure, but would you not agree that not all gerrit changes can be considered with the same weight when it comes to how important is it to catch issues with it early? Even though the amount of arch/hw-related changes is most probably <10%, these often fall into the category that are best fixed early because they can be painful and annoying later.

Sure, all changes can have the same weight, but the cost of that verification also needs to be considered if it runs for every single change, not just those that would benefit from those.

You cannot have such exhaustive testing for every patch set in every change and avoid those usage spikes at the same time.

I still think that instead ~12 configs (assuming there'll be a few more) with a single one of them SIMD-enabled, having many or most of them SIMD (and/or GPU-acceleration) enabled would suffice in covering the mainstream architectures.

Fine. But just saying that does not make it happen. My initial proposal (that I already said had a lot of rough edges at the time I published it) was based on Erik's initial comment on this issue that "testing architecture-specific code is probably of the lowest priority" (that no one had ever refuted until now). And my initial proposal has been the only concrete proposal in this whole discussion for the verification matrix, despite numerous requests from me for other people to also contribute. I don't actually care that much about the contents of the matrix, but I was just trying to advance this issue. Perhaps I shouldn't have...

Hence, keeping a change "hot" in gerrit while important issues are fixed seems like a good reason to explicitly test those technical aspects of the code (e.g. acceleration, parallelization, GPU offload) that are error-prone and reasonably easy/lightweight to test.

Sure, lightweight testing is fine. But trying to detect all possible SIMD+compiler combination issues would require quite extensive testing which isn't exactly cheap...

On a second thought, doesn't the reference build turn off both thread-MPI and MPI?

Yes, it does.

Sure, I did not realize it was error-prone code. I'd still suggest a pragmatic approach: just because testing something is nearly free, it may still be worth omitting it from the verification if it's not of high priority - if nothing else to keep the verification job specs simple.

If we don't want to even keep some options compilable all the time, why would we even want to support it? Without X11 in the per-patchset verification matrix, it will be difficult for most people to even realize that it gets broken, or to fix it (I, for one, would prefer to not try to figure out how and to set up everyting to do an X11 compilation on my OS X just so that I could test/fix every single refactoring change locally).

#50 Updated by Mark Abraham over 3 years ago

I think the pre-submit matrix we have at https://gerrit.gromacs.org/#/c/5461/18/admin/builds/pre-submit-matrix.txt plus the WIP comments at https://docs.google.com/document/d/1doH9KClDFcqVrHJqDdgRpd9cwSs3l5bwqCIsZ-vcJfA/edit is a reasonable reflection of people's thoughts above. The role summary there is also good.

That an alternative form of one or other config is flaky for reasons unknown shouldn't stop us from doing something useful :-) Preferably fix the problem, but if that can't be done somehow, then we're not paid for perfection.

#51 Updated by Teemu Murtola over 3 years ago

Continuing the discussion from https://gerrit.gromacs.org/5461:

Szilard Pall wrote:

Actually, both Mark and I proposed changes and did work on getting those actually working. That's not "no one", but it is indeed pathetic compared to the amount of people who might care if they were more informed/involved. However, that's not something I can fix and it would be best discussed elsewhere -- or nowhere ;)

What I see that you worked on getting the OpenCL stuff working, and proposing changes to exactly those configurations. But I saw very little concrete action on this before I pushed up the change that tried to add OpenCL to the verification matrix. The whole idea of me pushing up the change that for sure didn't work was to bully someone to take action on the complaining that OpenCL is constantly broken...

Spot on. But we have reached out to experimentation with advanced software tolls like private cloud/IaaS/containerization even before making sure that we have a new infrastructure and that devs during the v2016 development have their butts covered. At least the latter will hopefully get sim-done before a beta.
However, same applies as above, I can't do much about all this.

Why not? Who do you expect to do something about any of this? Who do you expect to react to your complaining about poor coverage of the verification matrix?

How can I give better feedback? I thought I did make my suggestions clear regarding pre- and post-submit configs.

Who do you expect to guess which exact changes to the pre- and post-submit matrix would make you happy and would work together with the existing infrastructure? Who do you expect to consolidate all the conflicting wishes from different people? I don't think "feedback" is a good word for this, since there is clearly no one to act on that feedback. What we need is ownership of the verification matrixes, and sorry, no, I will decline this responsibility. I have tried to help with the technicalities of getting things better documented and more easily extendable and usable, but that's about as far as I can do.

I suggest you make a concrete change proposal to pre-submit-matrix.txt, including the exact configs you want to change/add, and the exact configs you want to have there. And explain the rationale for the changes in the commit message, including how they relate to the discussion earlier in this issue. Then you can invite reviewers whose views might be different based on the discussion earlier, and we can discuss things in that concrete context. And you need to be prepared to follow up on the discussion and take those different views into account.

Also available in: Atom PDF