Project

General

Profile

Task #1815

implement and execute plan for new releng machinery

Added by Mark Abraham over 4 years ago. Updated about 3 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
Jenkins
Target version:
-
Close

Description

Current status:
  • new-releng jobs seem to be running OK, with some already replaced with a workflow job (with a -workflow suffix). Old configurations still exists, but are inactive. When can we delete them?
  • Coverage build script and basic support for it exist in new-releng, and a post-submit coverage job is running. Per-patchset build can be triggered by commenting "[JENKINS] Coverage" on any change, and the results get posted back once the build completes (they do not affect the voting). For new patch sets, the build can be requested again if needed. Old per-patchet builds exist, but are disabled.
  • Matrix build uses a matrix from the gromacs repo. Host mapping is automated using the releng script, with labels for each host specified in the Python code. It is possible to specify new build options by changes in gromacs.py only, as long as they do not influence the host mapping (and do not need to access node-specific locations etc). The matrix build is triggered through a workflow launcher job, which avoids using an extra executor slot for the launcher job.
  • Using workflow build for the multi-configuration builds is outruled for now, but can be revisited once the plugin has less issues and some of the most critical missing features are available. The code to run this still lives as a draft in Gerrit, and has TODOs for the various issues observed. Rebasing would probably simplify it significantly, since most of the machinery now exists for the other workflow builds.
  • Release_workflow_master has a reasonable replacement for Deploy_* jobs for creating source and regression tests packages, and testing them and building the documentation for publishing. Test coverage is not yet as good as it was previously, but otherwise all functionality should be covered. This job also demonstrates what currently is possible for a multi-config build using a workflow job, although some additional reporting would likely be possible.
  • Unit tests for releng Python scripts run in Releng_Gerrit_master job for changes to releng.
  • AddressSanitizer and LeakSanitizer are now running for master in Jenkins, with reasonable reporting of any errors as test failures in Jenkins (not just in console output).
  • Releng documentation is built as part of the documentation build and included in the developer manual.
Next steps/TODO (not necessarily in order):
  • Get the long-term verification matrices in shape (both pre- and post-submit). Probably requires some extra work on the script to assign hosts, but mostly this is about deciding what the matrices should contain. Related: https://gerrit.gromacs.org/5711, https://gerrit.gromacs.org/5461
  • Extend test coverage for the release workflow builds. After https://gerrit.gromacs.org/5823, the main things that are missing is verification of the MD5 sum for regression tests download and compiling the template.
  • Start using Docker for something (preferably at least "normal" builds that do not require specific hardware, and possibly also the other special builds like uncrustify, documentation etc.).

Associated revisions

Revision c27506bb (diff)
Added by Mark Abraham about 4 years ago

Add Jenkins options matrix to source repo

This version dumps Gromacs_Gerrit_master-new-releng and
Gromacs_Gerrit_5_1 matrix configs respectively into pre- and
post-submit matrices. Later, we hope to get Jenkins to
trigger verification from such matrices.

Refs #1815

Change-Id: I10966e8bf4ae22b7a511cbd3f89000c8eca9f692

History

#1 Updated by Mark Abraham over 4 years ago

  • Description updated (diff)

#2 Updated by Mark Abraham over 4 years ago

Mark Abraham wrote:

At https://gerrit.gromacs.org/#/c/5003/ Teemu proposed

So, an actual plan for merging this:
1. Finalize the per-patchset verification matrix in the new-releng job (http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-new-releng/). I think it is better to think this ground-up at this point, since the previous matrix is in a somewhat messy state after X11 and all valgrind configs got accidentally dropped, and now all old compilers were removed as well.

Currently it is

"gcc-4.1 no-openmp host=bs_nix1004" 
"gcc-4.4 gpu cuda-4.0 mpi openmp host=bs_nix1204" 
"gcc-4.8 gpu cuda-7.0 openmp release host=bs_nix1310" 
"gcc-4.9 openmp tsan fftpack host=bs_nix1310" 
"gcc-5.1 double mpi host=bs_nix1404" 
"clang-3.4 double no-openmp fftpack asan host=bs_centos63" 
"clang-3.6 double no-openmp fftpack mdrun-only host=bs_nix1204" 
"msvc-2010 openmp release host=bs_Win2008_64" 

That seems fine to me. I don't want to re-think the pre-patchset verification coverage while all the build-script machinery is in flux.

2. Ensure that all the currently set up Jenkins jobs work together with https://gerrit.gromacs.org/4880, and submit 4880.

Sure , all the current Jenkins *-new-releng jobs should be checked to work.

3. Do final cleanup in this change, ensure that it works, and submit it.

Fine

4. Copy the current matrix job into a post-merge job (it can be then improved separately).

You want to convert Gromacs_Gerrit_master into (the basis for) a post-submit job? Seems OK.

5. Convert the build jobs to use this one by one, and test them at the same time, including some failure scenarios. Fixes to get things running hopefully only need to change releng, not the actual build scripts.

I don't follow. We'd already have a pre- and post-submit matrix. What would be converted? Does https://gerrit.gromacs.org/#/c/5003/ or https://gerrit.gromacs.org/#/c/4880/ have legacy-support scripts that we'd plan to phase out?

6. Change all the jobs that vote to Gerrit to use this approach. Do we need to keep the builds for the old jobs? I.e., should we manually change the configuration for each of the existing jobs, or can we just delete the old ones and rename the new ones to the old names?
7. Continue development with any additional features (e.g., include the coverage build here).

Not sure about 6 because not sure about 5.

Later, we should move the matrices into a repo. Source repo seems most appropriate to me, but however we do it seems like we need a lightweight Jenkins job to check out the repo, read the configuration file, and populate a matrix. I've tried to do this in a single Job, but I can't do all of the below in that order.

a. git checkout on host
b. read file from workspace
c. populate matrix from file contents
d. launch matrix configurations

So I we'd want to use the Parameterized Trigger Plugin to do a,b,c and it triggers a job that inherits everything and does d. I hope the subsequent Gerrit voting works, but that remains to be seen.

#3 Updated by Teemu Murtola over 4 years ago

Mark Abraham wrote:

That seems fine to me. I don't want to re-think the pre-patchset verification coverage while all the build-script machinery is in flux.

That matrix still has several issues:
  • it was made before the C++11 change, and makes it impossible to move forward with that of we take it as-is.
  • it leaves significant parts of the code outside all verification (e.g., gmx view and all SIMD-specific code)
  • it was made by selecting a subset of the existing matrix configurations, which severely limited the ability to cover significant this.

But since all my attempts to even start discussion on this (in #1732, or in gerrit) don't seem to lead anywhere, I don't really care any longer...

5. Convert the build jobs to use this one by one, and test them at the same time, including some failure scenarios. Fixes to get things running hopefully only need to change releng, not the actual build scripts.

I don't follow. We'd already have a pre- and post-submit matrix. What would be converted? Does https://gerrit.gromacs.org/#/c/5003/ or https://gerrit.gromacs.org/#/c/4880/ have legacy-support scripts that we'd plan to phase out?

Just merging those two changes will not charge anything before the Jenkins configuration is changed to actually use the scripts in the jobs that vote. And the testing for those jobs hasn't been very thorough. But if we are happy with temporarily unstable verification and fixing things later, then we can leave out any additional testing at this point.

#4 Updated by Mark Abraham over 4 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

That seems fine to me. I don't want to re-think the pre-patchset verification coverage while all the build-script machinery is in flux.

That matrix still has several issues:
  • it was made before the C++11 change, and makes it impossible to move forward with that of we take it as-is.
  • it leaves significant parts of the code outside all verification (e.g., gmx view and all SIMD-specific code)
  • it was made by selecting a subset of the existing matrix configurations, which severely limited the ability to cover significant this.

Good, we can add an X11 and at least one real-SIMD setting. (I'd suggest moving X11 to post-submit testing in future, but it's not important either way.)

But otherwise this just illustrates the co-dependency problem. Simultaneous consideration of
  • GROMACS versions present, current and future,
  • build scripts old and new,
  • Jenkins infrastructure old and new,
  • testing coverage changes,
  • software requirements old and new

make for lots of complex inter-linked problems. Brainstorming over a large range of things we want to change is great. But just like a change to software code, when we go to implement things, I think we should do things in chunks that affect as few parts as possible. Considering all of them at once just leads to stagnancy - we can't make a final decision on the C++11 subset until we have tested C++11 compiler support on drafts of some real code, but it doesn't make sense to get Jenkins to do that until the new releng machinery is in place, etc.

5. Convert the build jobs to use this one by one, and test them at the same time, including some failure scenarios. Fixes to get things running hopefully only need to change releng, not the actual build scripts.

I don't follow. We'd already have a pre- and post-submit matrix. What would be converted? Does https://gerrit.gromacs.org/#/c/5003/ or https://gerrit.gromacs.org/#/c/4880/ have legacy-support scripts that we'd plan to phase out?

Just merging those two changes will not charge anything before the Jenkins configuration is changed to actually use the scripts in the jobs that vote. And the testing for those jobs hasn't been very thorough. But if we are happy with temporarily unstable verification and fixing things later, then we can leave out any additional testing at this point.

I've been assuming that the *-new-releng jobs use the new machinery in those two patches. If so, then we make sure they all pass before we submit the patches. We tell gmx-developers they need to rebase (in moderation), and should expect some rough edges. Then we give the *-new-releng jobs voting powers. Rebase some jobs and see how things go for a few days, be conservative with actually submitting patches (e.g. I wouldn't submit Erik's SIMD patches). AFAICS the old Jenkins configs will still work, and the *-new-releng ones will get some battle testing. When after a few days we haven't got any known issues with the new configs, we stop the old configs voting and do some renaming of configs.

There will be coverage gaps, but there are already coverage gaps (e.g. X11 in Gromacs_Gerrit_master got lost some time, I think) and one of the objectives is to reach a point where we have something like self-documenting coverage. Changing coverage now before we submit the patches means we have to fix any existing bugs in the coverage gaps before the resulting configs are clearly useful. That delays downstream projects (move testing matrices to repos, unresolved C++11 questions, kernel generation infrastructure decisions, Docker Jenkins implementation) for no real gain (any bugs not found because of the coverage gaps don't really happen until we start submitting patches after the old configs stop voting and before we implement the new pre- and post-submit matrices that might find them).

#5 Updated by Teemu Murtola over 4 years ago

I agree with your points, but most of the mentioned issues weren't even on the table when I started with the script, and it wasn't just my choice to have it sit idle for several months in Gerrit while the world changed around it. There was just #1732 to address.

And my choice was between trying to preserve the existing matrix, or writing the script with the new matrix from #1732 in mind. And I saw no point in spending a lot of effort in replicating the old matrix, when it was already identified to have issues; effort spent there would be much better used to actually first finalize something in #1732. But apparently I overestimated the enthusiasm to do something there.

So now, if we are anyways not going to decide on the matrix, I would simply follow your approach and not change anything. We can still merge the changes and make the other jobs use these scripts. We can return to the matrix job(s) when the time is right, which might mean waiting do some additional infra work to happen.

As for testing, an essential part to test is to also test that the jobs detect issues they are supposed to, not just that they pass. And so far, testing has been quite non-existent for some of the job types.

#6 Updated by Mark Abraham over 4 years ago

Enthusiasm is one thing, available time is another :)

I agree that we should have better coverage (which is one of the main points of #1732). I see clear advantages to getting releng stable, so we can explore moving the matrix into a repo, so that hopefully we can have people review a concrete proposal for coverage on the various tiers and test that proposal alongside the review discussion.

#7 Updated by Teemu Murtola over 4 years ago

From the original plan, there are still a few things remaining:
  • Agree on the per-patchset verification matrix (probably postponed until some additional infra work is done).
  • Remove the old jobs once sufficient testing of the new jobs has been done. An somewhat open question here still is whether we can just delete the old jobs, with their history, or whether we should do something else? Also, when we do this step, we also should check the retention settings for the various new-releng builds; currently they are set to only keep very few builds, to avoid the documentation build in particular filling up disk (since not-so-long-ago, there were several Jenkins outages because of full disks).
Additional future developments on the table (I might have missed a few):
  • Make Jenkins more robust. There is some discussion on gmx-developers. It is unclear whether it is just an unhappy coincidence that it is now unstable, or whether it is related to the temporarily increased load. Nonetheless, the current setup seems to break far too often because of technical reasons (e.g., Jenkins throws an exception with "Channel is already closed" at some point during the build). Today, it has at least gotten somewhat better...
  • Make most of the builds use Docker instances instead of the current hard-coded hosts. Some work has been done, but it is not clear where we are.
  • Move the matrix configuration into the source repo. This can further be split into a few things:
    • Script to preprocess the matrix (and potentially also assign the hosts automatically).
    • Figure out the best Jenkins configuration combination for the Gerrit Trigger and the Parameterized Trigger plugins, and the two chained builds that this requires.
    • Agree on the contents of the matrices (#1732 and essentially the same as the first point from the original plan).
  • Cleanup of the builds (e.g., run clang analysis again also for test code, simplify coverage builds if someone would like to run them locally).
  • Refactoring of the releng scripts to make testing script modifications easier.
  • Documentation nightly build to use releng (already done).
  • Coverage build to use releng (pending some changes to support triggering it on demand and review in Gerrit, and agreement on how it should be triggered).

#8 Updated by Teemu Murtola over 4 years ago

Additional things to my previous list:
  • Use LeakSanitizer (or valgrind, or something) for (at least) one of the per-patchset builds.
  • Make documentation from releng to end up in the developer guide, and improve combined coverage of it.

#9 Updated by Teemu Murtola over 4 years ago

And one more item to the list:
  • Splitting compilation/building and testing to different jobs in Jenkins.
What have people already done, and what would be the highest priority on the agenda here? All that I have concretely done/WIP is now in Gerrit, but here are some additional thoughts on other items on the list:
  • Cleaning up the existing builds should be straightforward, and I can do something there if no one else has time. Should also be quite independent of other stuff (except that cleaning up the coverage build first would need the supporting changes submitted).
  • I can also write at least the baseline script for preprocessing the matrix from source repo for use in Jenkins. Mark also expressed interest on doing this, so it would be good to synchronize.
  • For investigating different Jenkins configurations to use with build configurations specified in the source repo, we should try out different alternatives and see how these integrate with Gerrit Trigger and how they work with Jenkins load:
    • Use a post-build parameterized trigger, like it is now.
    • Use a build step parameterized trigger, and possibly make the step blocking (which could make it possible to not have Gerrit Trigger configuration in the parent).
      • If we want to get rid of the matrix build, we could also use the facility to trigger multiple builds from files matching a pattern: the releng script could generate one file for each configuration to build, and the trigger would act on these, instead of a file that contains the full matrix.
    • A somewhat separate question is whether we could just have one, generic build that does the actual build, and just different triggering jobs for different branches. This might simplify maintenance.
    • The plan to also separate compilation and testing may have an impact on which of the patterns here can also support that case.
  • Refactoring of the releng scripts for local testing I can also do off to the side, when there is nothing more urgent to get done. I have some thoughts on how to make most of the contents testable with Python unittest, making it easier to develop some stuff outside Jenkins.
  • I also have some thoughts on how to combine the documentation from the two repos; it is in principle straightforward, since everything is already available in the Jenkins workspace during the documentation build. This I can also do at some point when there is nothing more urgent.
  • Some sort of leak checking would be very nice to reinstantiate, but since LeakSanitizer doesn't work at all on OS X, there is little I can easily do to help here (the latest released valgrind doesn't work with OS X Yosemite, either).

#10 Updated by Mark Abraham over 4 years ago

Teemu Murtola wrote:

From the original plan, there are still a few things remaining:
  • Agree on the per-patchset verification matrix (probably postponed until some additional infra work is done).

I have made my own sketch of suggestions for things to test at the various times, but I haven't yet found and integrated all the suggestions from others.

  • Remove the old jobs once sufficient testing of the new jobs has been done. An somewhat open question here still is whether we can just delete the old jobs, with their history, or whether we should do something else? Also, when we do this step, we also should check the retention settings for the various new-releng builds; currently they are set to only keep very few builds, to avoid the documentation build in particular filling up disk (since not-so-long-ago, there were several Jenkins outages because of full disks).

I think we've let enough time pass that we can stop the old jobs from running ~now. I suggest we leave the configs in place for a few weeks, and then dump them (perhaps caching their config.xml files). I don't think there's significant value in keeping results from the old configs, so we can reclaim that disk space and expand the settings on the new-releng jobs. I suggest we keep the *-new-releng jobs with their names for now, and plan to re-instate the un-suffixed versions with the new parameterized trigger versions discussed below.

Additional future developments on the table (I might have missed a few):
  • Make Jenkins more robust. There is some discussion on gmx-developers. It is unclear whether it is just an unhappy coincidence that it is now unstable, or whether it is related to the temporarily increased load. Nonetheless, the current setup seems to break far too often because of technical reasons (e.g., Jenkins throws an exception with "Channel is already closed" at some point during the build). Today, it has at least gotten somewhat better...

Ja, I know things are painful, but it's hard to diagnose some kinds of problems. Stefan has replaced the GPUs (no effect) and now the memory (seems good now) but finding out the real problem is challenging. Eventually taking compilation off these machines should help, I think.

  • Make most of the builds use Docker instances instead of the current hard-coded hosts. Some work has been done, but it is not clear where we are.
  • Move the matrix configuration into the source repo. This can further be split into a few things:
    • Script to preprocess the matrix (and potentially also assign the hosts automatically).
    • Figure out the best Jenkins configuration combination for the Gerrit Trigger and the Parameterized Trigger plugins, and the two chained builds that this requires.
    • Agree on the contents of the matrices (#1732 and essentially the same as the first point from the original plan).

Needing to bump both repos and Jenkins is painful. I think the path of least pain is

  1. Move the existing Gromacs_Gerrit-new-releng matrix into a source-repo per-patchset-verification file with one matrix config per line. Add another matrix for post-patchset verification that is approximately the config for 5.1, but don't plan to use it yet - just trying to minimize repo bumps. These matrices will have static hosts/labels (for now). I'll do this today.
  2. Have a Parameterized Trigger job check out the source repo, get the pre-submit matrix into a suitable form to pass to the parameterized Matrix job. I think that will require a temporary script somewhere, because the option for Parameterized Trigger to read a file to set a parameter that sets the OPTIONS matrix only works if the file contains a single line per parameter set. I'll play with this and/or Teemu's suggestions in comment 9 today.
  3. Write a preprocessing script in releng repo that takes the matrix from the source repo, re-assigns any hosts/labels, and arranges to sets OPTIONS for the downstream Matrix job. Sketch at https://gerrit.gromacs.org/#/c/5031/3, but it can be done better and re-use the releng option-handling machinery. Teemu you mentioned you had ideas here, do you want to have a go at this?
  4. Extend the application of that script so that we have the post-submit verification matrix above working via a separate Parameterized Trigger (perhaps we can manage to make it run the same Matrix job? If that's still a thing.)
  5. Move as many testing configs as possible to Docker-based slaves (to enhance overall stability)
  6. Propose and test the long-term pre- and post-verification matrices (clean out static hosts/labels from source repo at this time)
  7. Split compilation and testing phases with some kind of Docker-based magic (we have Docker images that will do a CUDA compilation already, but how to schedule life in Jenkins is unclear)
  8. Propose and test nightly/weekly matrices

I haven't considered any particular needs of the docs/coverage/uncrustify jobs. We could either leave them as they are for the moment, or we could target some to use Docker slaves in the short term. That might be a useful learning experience and help get load off the GPU slaves?

  • Cleanup of the builds (e.g., run clang analysis again also for test code, simplify coverage builds if someone would like to run them locally).
  • Refactoring of the releng scripts to make testing script modifications easier.
  • Documentation nightly build to use releng (already done).
  • Coverage build to use releng (pending some changes to support triggering it on demand and review in Gerrit, and agreement on how it should be triggered).
  • Use LeakSanitizer (or valgrind, or something) for (at least) one of the per-patchset builds.

LeakSanitizer is on by default in AddressSanitizer, so I imagine we have it already. But I agree we want something like it working pre-submit.

  • Make documentation from releng to end up in the developer guide, and improve combined coverage of it.
  • Splitting compilation/building and testing to different jobs in Jenkins.

I agree for all the other things. I think the priorities are in my numbered list above

#11 Updated by Roland Schulz over 4 years ago

I think it wouldn't be too hard for me to add something to a Jenkins plugin. Let me know if you need anything which would be much easier if it is done by a plugin rather than a python script.

#12 Updated by Teemu Murtola over 4 years ago

Roland Schulz wrote:

I think it wouldn't be too hard for me to add something to a Jenkins plugin. Let me know if you need anything which would be much easier if it is done by a plugin rather than a python script.

What could help (a small thing, but still) would be that instead of https://gerrit.gromacs.org/5069, Gerrit Trigger would provide a configuration option to not vote, but still post the results back to Gerrit. Now there are three options: either a build votes, it can be marked to skip voting, or it can be made silent. If a build is skipped, then it does not participate in voting, but if an event only triggers skipped builds, then Jenkins still votes zero for that event. You can see the undesired behavior in https://gerrit.gromacs.org/5062 (patch set 7), where the triggered coverage build removes the original Jenkins vote, even though it should be skipped. And the desired behavior is like in patch set 8, where you still get the build link back, but it does not invalidate the old vote.

There probably still are use cases where the approach in https://gerrit.gromacs.org/5069 is useful, but Gerrit Trigger could be more robust (since it is also some trouble to get the build result into the message from https://gerrit.gromacs.org/5069, and in making it a post-build step so that it works even if the build times out or otherwise crashes).

#13 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1815.
Uploader: Mark Abraham ()
Change-Id: I10966e8bf4ae22b7a511cbd3f89000c8eca9f692
Gerrit URL: https://gerrit.gromacs.org/5073

#14 Updated by Roland Schulz about 4 years ago

I changed the Gromacs_Gerrit_master-new-releng configurations to accommodate the required versions after the C++11 patch.
I removed

gcc-4.1 no-openmp host=bs_nix1004

and changed

gcc-4.4 gpu cuda-4.0 mpi openmp x11 host=bs_nix1204
msvc-2010 openmp release host=bs_Win2008_64

to

gcc-4.6 gpu cuda-5.0 mpi openmp x11 host=bs_nix1204
msvc-2013 openmp release host=bs-win2012r2

GCC 4.6 and VS 2013 are the minimum compiler versions. GCC 4.6 requires CUDA 5.0 and VS 2013 is only on bs-win2012r2. I didn't see anything the bs_nix1004 config provided extra.

#15 Updated by Teemu Murtola about 4 years ago

The gcc-4.1 configuration might have been the only one that was implicitly running CMake 2.8.8.

#16 Updated by Mark Abraham about 4 years ago

@Roland OK, I will update patch 5073 accordingly. I was planning for the C++11 patch to be able to handle its own matrix updates in due course, but if there's desire for the matrix to match sooner than that, we can do that also.

#17 Updated by Teemu Murtola about 4 years ago

LeakSanitizer is on by default only in clang-3.5 and above, but our asan build is using clang-3.4. And in order to take it into use, we probably need to work on some infrastructure to suppress issues we don't care about at the moment. And preferably also on infra to show errors clearly in Jenkins.

#18 Updated by Mark Abraham about 4 years ago

Teemu Murtola wrote:

LeakSanitizer is on by default only in clang-3.5 and above, but our asan build is using clang-3.4. And in order to take it into use, we probably need to work on some infrastructure to suppress issues we don't care about at the moment.

I see. We will move the "clang-3.4 asan no-leaksan" config to a Docker slave at some point, and shift the clang version to 3.7. I think at that time it makes sense to have a commit where the matrix changes to add LeakSan and deal with the issues arising at that time. Installing a new compiler on a VM slave before then doesn't seem to make great sense to me.

And preferably also on infra to show errors clearly in Jenkins.

Right

#19 Updated by Roland Schulz about 4 years ago

Teemu Murtola wrote:

There probably still are use cases where the approach in https://gerrit.gromacs.org/5069 is useful, but Gerrit Trigger could be more robust (since it is also some trouble to get the build result into the message from https://gerrit.gromacs.org/5069, and in making it a post-build step so that it works even if the build times out or otherwise crashes).

According to http://stackoverflow.com/questions/25047537/running-a-post-build-script-when-a-jenkins-job-is-aborted it is possible to run post-build even for aborted. What is the problem of getting the build result into the message?

#20 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

According to http://stackoverflow.com/questions/25047537/running-a-post-build-script-when-a-jenkins-job-is-aborted it is possible to run post-build even for aborted. What is the problem of getting the build result into the message?

Yes, it's possible to do this with a post-build step. But as far as I can tell, the build status is not easily accessible in any typical post-build steps (e.g, in an environment variable). And it's not possible to directly run a Python script as a part-build step, either. The mentioned Groovy plugin does provide the build status (and some other potentially useful stuff, at least if we don't care about its security restrictions), but it probably needs us to copy another potentially complicated script to many different jobs.

#21 Updated by Roland Schulz about 4 years ago

Just saw https://github.com/jenkinsci/workflow-plugin/blob/master/TUTORIAL.md . This might be better than the plan with the parameterized trigger.

#22 Updated by Roland Schulz about 4 years ago

Tested config with cmake 2.8.8 here: http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-new-releng-test/2/ . Requires https://gerrit.gromacs.org/#/c/5081/ (or the conflicting one). Is there already a consensus what minimum cmake version we want for 2016? The Google doc suggested 2.8.12. Should we bump the version now?

#23 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Just saw https://github.com/jenkinsci/workflow-plugin/blob/master/TUTORIAL.md . This might be better than the plan with the parameterized trigger.

That has potential, although it forces yet another scripting language. But probably we could keep the Groovy script relatively stable and keep as much of the build logic in Python as possible. And this could enable moving more of the configuration into releng, which would reduce maintenance cost. It could be worth testing, although integration with Gerrit Trigger and our multi-repository setup might require some non-trivial tweaking. I think we should now focus on a proof-of-concept of the whole thing with different approaches, and then select the approach we are going to use to actually implement it. And that POC should really encompass all the elements we want to have: Gerrit integration + possibility for manually triggering for cross-verification, easily accessible build results for developers, easily maintainable build scripts and job configuration, configurations from the source repo, extensibility to splitting the jobs for compilation and testing, etc.

#24 Updated by Roland Schulz about 4 years ago

Currently voting works as follows:
  • It finds the most negative vote of all jobs triggered together
  • While doing so it skips those marked as such
  • It post the result

If all jobs are marked as "skip" it votes 0. Is it sufficient to change that if all jobs are marked as "skip" that it doesn't vote? Or do we need the current working of "skip" too and thus need an additional option?

#25 Updated by Roland Schulz about 4 years ago

Having thought about it a bit more I see no reason to have the option of both behaviors and put this on the Jenkins tracker: https://issues.jenkins-ci.org/browse/JENKINS-30393 . It would be nice to not have to maintain our own version. So it would be good if we do something they accept upstream.

#26 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Having thought about it a bit more I see no reason to have the option of both behaviors and put this on the Jenkins tracker: https://issues.jenkins-ci.org/browse/JENKINS-30393 . It would be nice to not have to maintain our own version. So it would be good if we do something they accept upstream.

I agree that we probably at least do no need both behaviors, and probably for most uses it should be fine to just have the complete skipping. And yes, we should pick a solution that is acceptable upstream. Ideal solution that would also allow our on-demand jobs to vote could be that the plugin would aggregate the votes from all builds that have been done for a patch set (across different trigger events), taking the last vote from each job that has been triggered at some point, but that doesn't fit very well with the current design of the plugin (which is completely centered around the trigger events).

#27 Updated by Roland Schulz about 4 years ago

Yeah adding aggregation across both triggered and on-demand would be a bit more work inside the plugin. I think we could create a 2nd jenkins user (e.g. jenkins-ondemand) and a 2nd gerrit server in jenkins (using the 2nd user). Than the on-demand jobs would use the 2nd user (through the 2nd server config) and gerrit would aggregate the verify votes from both jenkins users. This might add a bit of unnecessary load on jenkins/gerrit (jenkins would probably read the stream events twice, but that is probably neglectable). If we need aggregating over more than one on-demand job that probably would work too as long as the user uses one comment to demand both. That way gerrit-trigger would aggregate over those.

#28 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Yeah adding aggregation across both triggered and on-demand would be a bit more work inside the plugin. I think we could create a 2nd jenkins user (e.g. jenkins-ondemand) and a 2nd gerrit server in jenkins (using the 2nd user). Than the on-demand jobs would use the 2nd user (through the 2nd server config) and gerrit would aggregate the verify votes from both jenkins users. This might add a bit of unnecessary load on jenkins/gerrit (jenkins would probably read the stream events twice, but that is probably neglectable). If we need aggregating over more than one on-demand job that probably would work too as long as the user uses one comment to demand both. That way gerrit-trigger would aggregate over those.

That sounds more complicated than its worth, in particular since it still doesn't scale very well. But we could use a similar, simpler approach to actually implement the non-voting behavior: if we configure a second server, with the same user, but with different voting commands, this probably would work. But there's still there overhead from multiple stream event listeners, and we need to ensure that no one creates a job with "any server", because probably it then triggers twice each time...

#29 Updated by Roland Schulz about 4 years ago

I created a proof of concept workflow job: http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow. It runs 2 configs (would work with any number) with the releng scripts. Currently the configs (=matrix) are still in the script but the script could be in git. There are many TODOs left. Most should be doable. The main thing which could be a deal breaker: not sure yet if all our post-build plugins work with workflow.

#30 Updated by Roland Schulz about 4 years ago

Most seem to be supported. AFAIK Text-finder is currently not. But we could either do that by putting it in the python script or by fixing Text-finder (probably not hard).

#31 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

I created a proof of concept workflow job: http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow. It runs 2 configs (would work with any number) with the releng scripts. Currently the configs (=matrix) are still in the script but the script could be in git. There are many TODOs left. Most should be doable. The main thing which could be a deal breaker: not sure yet if all our post-build plugins work with workflow.

There are a lot of issues still:
  • It is difficult to see from the build what it actually built, without following the link back to Gerrit. Most builds just show "No changes. No changes. No changes. No changes." The thing that makes this work somehow for our normal builds is the "Gerrit Trigger" choosing strategy for the git checkout (ideally, it would show the changes from all the repos, not just the one that triggers the build).
  • The git build status output can get really messy and dominating if we get two entries for every different configuration we build.
  • It's not possible to easily see what got built or which configurations failed (unless the workflow plugin creates some easy-to-understand information on the summary page for failures). The flow graph will be way too messy for our real build to quickly scan through.

These are probably solvable (at least to a better extent than is currently done), but probably this would need to be done before we can decide whether this really is the thing for us.

Most seem to be supported. AFAIK Text-finder is currently not. But we could either do that by putting it in the python script or by fixing Text-finder (probably not hard).

The only reason we really have this is to be able to set the build status to unstable. There are probably quite a few different alternatives for doing that, which could also be easier to understand than the current text search.

#32 Updated by Teemu Murtola about 4 years ago

Another thing that is not clear to me is whether it is possible to create real post-build steps with the workflow engine. By real I mean steps that always run, even if some of the preceding steps failed. Currently, we probably do not have such steps, though (but we might want to have some, if we want to get more information out of failed builds).

Also, the console output is impossible to read (except when accessing it through multiple steps through the Running Steps).

#33 Updated by Roland Schulz about 4 years ago

Teemu Murtola wrote:

There are a lot of issues still:
  • It is difficult to see from the build what it actually built, without following the link back to Gerrit. Most builds just show "No changes. No changes. No changes. No changes." The thing that makes this work somehow for our normal builds is the "Gerrit Trigger" choosing strategy for the git checkout (ideally, it would show the changes from all the repos, not just the one that triggers the build).

Looks right to me. Not sure what you mean.

  • The git build status output can get really messy and dominating if we get two entries for every different configuration we build.

Yes that isn't nice. Don't see anything configurable, so one would either need to fix it in the plugin or wait on feature request to be answered.

  • It's not possible to easily see what got built or which configurations failed (unless the workflow plugin creates some easy-to-understand information on the summary page for failures). The flow graph will be way too messy for our real build to quickly scan through.

If the test fails with warning/unit test those links might be useful. If one needs the console one would need to use the flow graph. Are you sure the flow graph would not work with more configurations, given that the failing ones would be highlighted?

These are probably solvable (at least to a better extent than is currently done), but probably this would need to be done before we can decide whether this really is the thing for us.

Most seem to be supported. AFAIK Text-finder is currently not. But we could either do that by putting it in the python script or by fixing Text-finder (probably not hard).

The only reason we really have this is to be able to set the build status to unstable. There are probably quite a few different alternatives for doing that, which could also be easier to understand than the current text search.

OK. It is possible to set it to unstable based on the python script exit code:

try {
sh ...
} catch (Exception e) {
if (e.getMessage().split()[-1] == 1) {
currentBuild.result = 'UNSTABLE'
}}

Using the try..catch should also make it possible to use real post build steps.

#34 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Teemu Murtola wrote:

There are a lot of issues still:
  • It is difficult to see from the build what it actually built, without following the link back to Gerrit. Most builds just show "No changes. No changes. No changes. No changes." The thing that makes this work somehow for our normal builds is the "Gerrit Trigger" choosing strategy for the git checkout (ideally, it would show the changes from all the repos, not just the one that triggers the build).

Looks right to me. Not sure what you mean.

The intended behavior with Gerrit Trigger is that the build summary page (and the Changes link) shows the title of the change that triggered the build, for each build. And this currently does not work: most builds triggered by Gerrit Trigger just show No changes.. It is partially caused by the checkout not yet using the correct refspec, but the only way to make the Changes work as expected is to use the special choosing strategy.

  • The git build status output can get really messy and dominating if we get two entries for every different configuration we build.

Yes that isn't nice. Don't see anything configurable, so one would either need to fix it in the plugin or wait on feature request to be answered.

I think this is mentioned (as a side remark) in the workflow tutorial. And it is caused by using the git plugin to do all the checkouts. The solution in the tutorial is to just do a single checkout, and then use archive/unarchive to move around the sources. For us, what would probably work better is to only use this approach for the releng scripts (or not even that), and do all the other checkouts from within those scripts, without the plugin. But we should still do one checkout for all the repos using the plugin, to get the Changes part (and also the other git build data, although that is likely less important) working.

  • It's not possible to easily see what got built or which configurations failed (unless the workflow plugin creates some easy-to-understand information on the summary page for failures). The flow graph will be way too messy for our real build to quickly scan through.

If the test fails with warning/unit test those links might be useful. If one needs the console one would need to use the flow graph. Are you sure the flow graph would not work with more configurations, given that the failing ones would be highlighted?

As it is now, there is nearly one screenful of links for each configuration in the flow graph. Scrolling through 20 screenfuls of links is not ideal, even though I agree that it is probably easy to spot anything that hasn't succeeded. Things might improve if we minimize the number of steps in each configuration, but we probably still need a few. It would help a lot of the branches below a node were collapsible, and collapsed per default.

OK. It is possible to set it to unstable based on the python script exit code:
[...]

Using the try..catch should also make it possible to use real post build steps.

Something like that probably works and can replace the text finder, although parsing the exit code from an error message does not seem ideal.

#35 Updated by Roland Schulz about 4 years ago

I think workflow still has a lot rough edges. Besides the issues mentioned the script interpreter also doesn't do all of groovy correctly (Java 5-style for loop, binary methods accepting Closure). So it might be better to wait. On the other hand it seems all the main Jenkins developers are behind this approach and it has a lot of momentum with many plugins already having added support. The matrix project type has several problems too and it might be less well supported in the future as more people move to workflow. The disadvantage of using workflow not immediately, is that we need to invest time in implementing a solution to move the matrix into git now, and then time to implement workflow later.

#37 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

I think workflow still has a lot rough edges. Besides the issues mentioned the script interpreter also doesn't do all of groovy correctly (Java 5-style for loop, binary methods accepting Closure). So it might be better to wait. On the other hand it seems all the main Jenkins developers are behind this approach and it has a lot of momentum with many plugins already having added support. The matrix project type has several problems too and it might be less well supported in the future as more people move to workflow. The disadvantage of using workflow not immediately, is that we need to invest time in implementing a solution to move the matrix into git now, and then time to implement workflow later.

(accidentally made these comments private the first time)

I think the main question on deciding whether to take it into use now is whether we can get acceptable behavior (about as good as for the current matrix build). In order to test that, it would be nice to test that the compiler warnings and jujit test results plugins work (which also allows us to test how unstable and falling builds work in practice). The warnings plugin is actually also an example of an post-build plugin that we would like to run also for failed build. And then we should check that we can get acceptable reporting from the git plugin about the checkouts: check that we can specify the choosing strategy, and that things work nicely if we won't use the plugin for more than a single checkout.

If these things work ok, I would assume that we can get by even with the current workflow plugin. We can then only be surprised positively if we can do more stuff. At least in principle, most more advanced scenarios would be much simpler with the workflow plugin, since there is a lot we can do even without that additional many plugins.

#38 Updated by Teemu Murtola about 4 years ago

The magic checkout invocation that we likely want to use/adapt is

checkout changelog: false, scm:
[$class: 'GitSCM', branches: [[name: '$CHECKOUT_REFSPEC']], doGenerateSubmoduleConfigurations: false,
extensions: [[$class: 'RelativeTargetDirectory', relativeTargetDir: '$CHECKOUT_REFSPEC'],
[$class: 'CleanCheckout'],
[$class: 'BuildChooserSetting', buildChooser: [$class: 'GerritTriggerBuildChooser']]],
submoduleCfg: [],
userRemoteConfigs: [[refspec: '$CHECKOUT_REFSPEC', url: 'ssh://jenkins@gerrit.gromacs.org/$CHECKOUT_PROJECT.git']]]

The changelog: false we can use to get rid of the unnecessary entries in the Changes section, but we can still consider using a separate script if that allows us to reduce unnecessary build steps that are shown in Running Steps.

#39 Updated by Teemu Murtola about 4 years ago

In order to test the real behavior for the workflow, we should upgrade the warnings plugin (and analysis-core), so that those can be used from a workflow script.

Given that it is possible to specify all the properties for the git checkout, and to exclude unwanted checkouts from the changelog, I'm quite confident that we can make the checkouts work as we want, so the only reason to do anything for them for now is to reduce clutter on the Running Steps page, in case that matters for evaluating the easiness of figuring out why a build failed.

Another thing that might be nice would be to keep the timestamps in console output, but that's just nice-to-have.

#40 Updated by Mark Abraham about 4 years ago

Teemu Murtola wrote:

In order to test the real behavior for the workflow, we should upgrade the warnings plugin (and analysis-core), so that those can be used from a workflow script.

Not sure what "analysis-core" is, but I updated Warnings, Static Analysis Utilities and Parameterized Trigger plug-ins. And JUnit.

#41 Updated by Teemu Murtola about 4 years ago

Mark Abraham wrote:

Not sure what "analysis-core" is, but I updated Warnings, Static Analysis Utilities and Parameterized Trigger plug-ins. And JUnit.

analysis-core = "Static Analysis Utilities"

I configured the workflow job to be a bit closer to what we would like it to be, and now other builds than the Windows build should ~work. But I haven't yet had time to test how it actually works with compiler warnings/errors or other build unstability reasons (some of which do not yet work at all for sure). We might still need to go to a shell script that does the git checkouts, since changelog: false does not remove the multiple useless "Git Build Data" links (which all anyways point to the same page...).

We may also want to only trigger this on a particular topic, to avoid duplicate verification of every single change, but I did not yet change the trigger configuration.

#42 Updated by Mark Abraham about 4 years ago

We can e.g. trigger on verify-matrix topic, which might anyway become relevant later.

I am cautiously optimistic that we can make workflow useful enough in the short term.

I have had a look at the Gerrit trigger plugin source, and I'm not keen on the thought of trying to make it listen for results from a downstream matrix job.

I am open to some solution that has each config in the matrix post its own link to Gerrit, but I don't yet see any way to implement something that has Gerrit Trigger Plugin know that it is the x-th entry from the matrix (whether in source repo or not).

#43 Updated by Roland Schulz about 4 years ago

The voting with downstream (e.g. parameterized->matrix) is OK right? You can make the one wait on the result of the other. But what doesn't easily work is, to make the link to the actual job show up in Gerrit, correct (so that one doesn't need to click twice to get to it). Just making sure I understand correctly what does and doesn't work with the parameterized job proposal.

#44 Updated by Teemu Murtola about 4 years ago

The warnings plugin might not really be up to the job yet. We can probably improve things somewhat from how it currently works, but as with the git plugin, it generates a lot of duplicate links that all point to the same HTML page (and none of those links really work). And a lot of entries to the build summary, where the links from there also all lead to the same page (e.g., http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/113/ has a link that says 9 warnings, but when you click on it, you actually see 18, since that is what some other build happened to generate. And all links from the sidebar lead to a page that says zero warnings).

#45 Updated by Roland Schulz about 4 years ago

Have you tried to run the warning plugin step after the parallel?

#46 Updated by Mark Abraham about 4 years ago

Roland Schulz wrote:

The voting with downstream (e.g. parameterized->matrix) is OK right? You can make the one wait on the result of the other. But what doesn't easily work is, to make the link to the actual job show up in Gerrit, correct (so that one doesn't need to click twice to get to it). Just making sure I understand correctly what does and doesn't work with the parameterized job proposal.

Yes, I think the voting is accurate when the dispatcher job correctly waits on the matrix job, but having to click through a second link is something I think we'd like to avoid. I think we might accept that if it's otherwise the best solution. Maybe we can make a post-build stage of the dispatcher job that would call a Gerrit command-line function to post as Jenkins Buildbot the link to the matrix job when it's UNSTABLE or FAILED?

#47 Updated by Teemu Murtola about 4 years ago

That might work (although making paths relative probably doesn't), haven't tried. But I'm a bit sceptical that it wouldn't be confused by mixed output from different modes in the same log. It's also a bit backwards that we need to allocate a separate executor to just run a post-build plugin...

#48 Updated by Roland Schulz about 4 years ago

Parsing the log shouldn't take much CPU so we could just run this on master. But allocating an executor for a short amount of time should also not matter.

#49 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Parsing the log shouldn't take much CPU so we could just run this on master. But allocating an executor for a short amount of time should also not matter.

Probably not. It's not clear whether you can run a step() outside of a node, though, so allocating an executor might be the only choice. But neither of these solves the problem that the console log is a mess at that point, and it's not clear whether there is any guarantee for any kind of continuity in the log. So a multi-line warning might well get mixed with output from some other slave executing on parallel. There are some identifying tags in the beginning of the lines, but I would be very positively surprised if the warnings plugin actually would use these to split the log (or be able to parse it in parts from the individual flow steps).

#50 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

Have you tried to run the warning plugin step after the parallel?

I now tried, and it is better (at least, no duplicate links), but
  • It does not handle warnings from different workspaces uniformly; some warnings have the workspace path in the file location, some do not.
  • It is impossible to associate the warnings with the configurations that generated them.

Also, without a lot of testing (or reviewing the plugin code), it's hard to say whether it works reliably in this stage (because of the reasons I mentioned earlier).

#51 Updated by Teemu Murtola about 4 years ago

Now the workflow job is about as good as I think we can easily make it. It does not build the correct change when triggered through Gerrit Trigger, but that should be easy to fix if we go this path. Building with parameters should allow testing it for different changes in Gerrit.

Now it's about trying to figure out whether it is good enough, or whether the matrix still provides better usability. I haven't tested how the build works if there are compiler warnings or test failures, which would probably be also good to test. There is no logic (yet) to set the build unstable based on something that happens within the Python script; that may also be slightly tricky, at least if we want the reason for this happening to show up somewhere.

#52 Updated by Teemu Murtola about 4 years ago

Mark Abraham wrote:

Yes, I think the voting is accurate when the dispatcher job correctly waits on the matrix job, but having to click through a second link is something I think we'd like to avoid. I think we might accept that if it's otherwise the best solution. Maybe we can make a post-build stage of the dispatcher job that would call a Gerrit command-line function to post as Jenkins Buildbot the link to the matrix job when it's UNSTABLE or FAILED?

Have you tried changing the "URL to post" configuration value in the dispatcher job? If environment variables set by the blocking Parameterized Trigger build step are available there, you could possibly construct the correct URL there, and get it posted instead of the dispatcher job URL. Another alternative could be to use the "Unsuccessful message file", and create a post-build step that puts the URL into a file, and then use that option to have Gerrit Trigger to read the file and post the contents.

Another thing that might or might not work to get both results posted back would be to make the dispatcher job trigger the matrix job from a post-build step, have both jobs triggered by Gerrit Trigger, and additionally specify "Other jobs on which this job depends" on the matrix build.

#53 Updated by Mark Abraham about 4 years ago

Side issue: are we happy enough with the current *-new-releng configs that we can disable the old Jenkins configs for master? I think things are working fine, but haven't had time to look at the proposals for releng.

#54 Updated by Mark Abraham about 4 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

Yes, I think the voting is accurate when the dispatcher job correctly waits on the matrix job, but having to click through a second link is something I think we'd like to avoid. I think we might accept that if it's otherwise the best solution. Maybe we can make a post-build stage of the dispatcher job that would call a Gerrit command-line function to post as Jenkins Buildbot the link to the matrix job when it's UNSTABLE or FAILED?

Have you tried changing the "URL to post" configuration value in the dispatcher job? If environment variables set by the blocking Parameterized Trigger build step are available there, you could possibly construct the correct URL there, and get it posted instead of the dispatcher job URL. Another alternative could be to use the "Unsuccessful message file", and create a post-build step that puts the URL into a file, and then use that option to have Gerrit Trigger to read the file and post the contents.

Another thing that might or might not work to get both results posted back would be to make the dispatcher job trigger the matrix job from a post-build step, have both jobs triggered by Gerrit Trigger, and additionally specify "Other jobs on which this job depends" on the matrix build.

Good suggestions, thanks! I haven't tried those.

#55 Updated by Teemu Murtola about 4 years ago

Mark Abraham wrote:

Side issue: are we happy enough with the current *-new-releng configs that we can disable the old Jenkins configs for master? I think things are working fine, but haven't had time to look at the proposals for releng.

The new-releng coverage job is still a bit in a flux (and requires decisions on how to proceed), but the others should be fine. Stuff in Gerrit for releng is either refactoring, or support for future developments (the coverage job and stuff discussed here). All fixes to issues that I've noticed have already been merged, but I haven't followed the builds very carefully.

#56 Updated by Mark Abraham about 4 years ago

Teemu Murtola wrote:

Now the workflow job is about as good as I think we can easily make it. It does not build the correct change when triggered through Gerrit Trigger, but that should be easy to fix if we go this path. Building with parameters should allow testing it for different changes in Gerrit.

Now it's about trying to figure out whether it is good enough, or whether the matrix still provides better usability. I haven't tested how the build works if there are compiler warnings or test failures, which would probably be also good to test. There is no logic (yet) to set the build unstable based on something that happens within the Python script; that may also be slightly tricky, at least if we want the reason for this happening to show up somewhere.

I tried http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/119/console on https://gerrit.gromacs.org/5046 (because the latter has plenty of issues) and was reasonably happy.

None of these are deal breakers for me. Some are probably known issues, or existing TODO items, which is fine.

I'll try out Teemu's suggestions in comment 52, but not before this evening.

#57 Updated by Teemu Murtola about 4 years ago

Mark Abraham wrote:

I tried http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/119/console on https://gerrit.gromacs.org/5046 (because the latter has plenty of issues) and was reasonably happy.

None of these are deal breakers for me. Some are probably known issues, or existing TODO items, which is fine.

Most of the issues you noticed I was also aware of, and I don't think there is any easy solution for any of them. Some comments below. I have a feeling that this will be somewhat of a usability regression from the old matrix build at this point, but I can't really judge how soon the issues could get fixed or whether other people think this regression is acceptable, given that in the future this likely gives us a lot more flexibility. I can help where I can, but I don't want to take the heat from this regression.

That could be possible, but would require changing gmxtest.pl and possibly also the Google Test XML output (and/or applying an XSLT transformation and some other processing to the XML files after they are produced). There is a field in JUnit XML that describes the test environment that we could probably use for this, and that Jenkins already shows; we would just need to populate it.

This is an issue in the warnings plugin. I can see a few alternatives:

  • Fix the warnings plugin to do something sensible within a workflow. Probably the reasonable behavior would be that it would only detect warnings produced by its enclosing node {} block, and that it would accumulate the warnings from multiple steps within the workflow, like the JUnit plugin does.
  • Make our local hack, and create duplicates for all the warnings parsers that we use, and make the regexes recognize the possible absolute paths to workspaces, and convert those to absolute paths in the workspace where the plugin is running.
  • Change our Jenkins configuration such that every single job in the workflow executes with the same absolute path to the workspace. But this is really difficult, since that would also mean that we would need to prevent concurrent execution on the same slave.

The first would be ideal, but can take some time (the whole workflow support in the warnings plugin is only a few weeks old, I think).

  • It would be nice to redirect most of what goes into the per-config console logs to files (saved as artefacts?), and perhaps only dump their contents to the main console log if there's a problem. That might get the main console log down to a 100-200 lines for a fully successful build.

If we would do that, it would mean that the console logs could not be accessed at all. And they could not be used for tracking the process of the build, either, nor for investigating how long different things take (in case we could make the timestamper plugin work within the workflow). The main console log is just a combination of the individual console logs; there is no separate control for what goes there.

  • It would be nice if we can figure a way to identify the configs that showed issues already on http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/119/. 'Duh, of course real MPI configs will break with what I wrote' seems like an example of a frequently occurring scenario that might require no further drilling of error messages.

Yes, but this is probably impossible. The only way with the current plugins would probably be to abuse artifacts for this purpose (e.g., generate an artifact for each failing build). What would be really nice (for this and a lot of other things) would be to have the addSummary() functionality from Groovy Post-Build plugin, but that is not currently available in a workflow...

#58 Updated by Teemu Murtola about 4 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

  • It would be nice if we can figure a way to identify the configs that showed issues already on http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/119/. 'Duh, of course real MPI configs will break with what I wrote' seems like an example of a frequently occurring scenario that might require no further drilling of error messages.

Yes, but this is probably impossible. The only way with the current plugins would probably be to abuse artifacts for this purpose (e.g., generate an artifact for each failing build). What would be really nice (for this and a lot of other things) would be to have the addSummary() functionality from Groovy Post-Build plugin, but that is not currently available in a workflow...

Actually, the code that does this in Groovy Post-Build plugin is close to trivial, so it might not be too difficult to make that work somehow. As far as I can tell, currentBuild.getRawBuild().getActions().add() could be used to create those build summary entries, provided that we could just create suitable instances of a subclass of hudson.model.Action (similar to GroovyPostBuildSummaryAction). But I'm not an expert on the internals of Jenkins and the workflow plugin, so I can't tell how that could possibly be done. Probably the simplest way would be still to have a plugin that provides a separate step for the workflow plugin, and writing one from scratch can be a bit tedious (although the amount of code would be very little; mostly it is just wiring up existing code together and probably providing some HTML content for the configuration pages). This is essentially https://issues.jenkins-ci.org/browse/JENKINS-26918, though.

#59 Updated by Roland Schulz about 4 years ago

On the Jenkins issue it is pointed out that currentBuild.description is already available. Is that sufficient for what you want or do we need more than that?

#61 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

On the Jenkins issue it is pointed out that currentBuild.description is already available. Is that sufficient for what you want or do we need more than that?

Indeed, changing the build description probably can be used as a workaround (I think it should be possible to insert arbitrary HTML there).

If we go this path, we might also want to consider what is our security approach for Jenkins. It is already possible for anyone who can create an account on Gerrit (which I think is anyone) to run arbitrary code on the build slaves, but this will make it possible for anyone who can push changes to releng to run arbitrary code on the Jenkins master (within the Jenkins process) and to inject arbitrary HTML at least to the build web pages.

#62 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

I reported the warnings issue: https://issues.jenkins-ci.org/browse/JENKINS-30551

I think there is one additional catch that you didn't yet mention there: even if you put the warnings plugin inside each node, you will get warnings from multiple builds into the generated output, so I guess that the plugin actually scans the whole top-level console log up to the point when it runs and just aggregates them. So the plugin should also be changed such that it only scans the console log produced by the node {} within which it runs. This might be tricky unless the workflow plugin already provides a method to do this.

#63 Updated by Teemu Murtola about 4 years ago

http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/120/ shows an example of what we can currently do for failed configurations. We probably cannot get the information about which configurations produced warnings, but for other unstable reasons we should be able to list the configurations there as well. And the workflow is flexible enough that with some extra effort, we can provide much more information than in the matrix build: we can, e.g., list for each configuration what target actually failed to build or which (types of) tests failed. This is only limited by how much effort we want to put into extracting this kind of information.

#64 Updated by Roland Schulz about 4 years ago

I think it would be nice to make those links to the console of the failed config.

#65 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

I think it would be nice to make those links to the console of the failed config.

Yes, but that can be hard without changes to the workflow plugin (or finding some undocumented feature from there). The console logs are found at .../execution/node/NN/log, where NN appears to be dynamically allocated depending on the order in which the steps start. So it is not possible to find out what is the number for the relevant shell execution steps without somehow getting this from the workflow plugin.

Using the build description for showing the message has a (very slight) downside that now the message also shows at the build listing (at http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/), which is likely not the intended use for this information.

#66 Updated by Teemu Murtola about 4 years ago

I put some additional effort into the workflow build. Here is a brief comparison of the workflow build vs. the old matrix build, on aspects that we probably cannot make work the same any time soon, to help in deciding what to do next:

  • Both have the built configurations and pass/fail status on the build front page (single click from Gerrit).
    • In matrix build, additional information about failing configurations is a single click more; there is no such functionality in the workflow build. The only way to access information organized by configuration is through Running Steps and finding the relevant console log from the long list of steps.
    • However, with extra effort, we can produce some more details on the build front page for the workflow build, but this can be very tricky for a matrix build.
  • Matrix build reliably reports ABORTED as its status when it is manually aborted, or if some part times out. The workflow build is FAILED most of the time even on timeouts, and it can be tedious to improve this (something can probably be done, though).
  • Timeouts work differently: the matrix build times out if there is no activity at all after some time (currently 15 minutes); the workflow build times out if a single build takes longer than the given time. The timeout in a workflow build does not work on Windows (the Windows build happily runs to the end, even if it exceeds the timeout), which can be big annoyance if there are deadlocks in the code.
  • Currently, the console log in the workflow build does not provide any timing information that would help in diagnosing why something timed out.
  • In a workflow build, compiler warnings are only accessible as one list over all configurations, without any information on where they originated from. Links to source code work sporadically. In a matrix build, compiler warnings listing identifies which build produced the warning, and all source code linking works. It is also possible to access compiler warnings produced by a single configuration by clicking on that configuration in the matrix.
  • In a workflow build, failed tests can only be listed as one big list. The only way to see in which configuration a test failed is to click on the test (the configuration is currently shown under "Standard Output" heading). In a matrix build, clicking on the unstable ball in the matrix allows one to see just the tests that failed in that configuration.
  • In a workflow build, the list of builds contains some not-really-useful text intermixed inside the list (see http://jenkins.gromacs.org/job/Gromacs_Gerrit_master-workflow/). This is a side effect from using the build description to contain information about the built configurations.
  • The workflow build behavior will not change significantly if we move the configuration listing to the source repo (I can implement this as well, if we decide to go down this path). It is unclear how this will affect the matrix build.

#67 Updated by Teemu Murtola about 4 years ago

Additional notes:

In a workflow build, compiler warnings are only accessible as one list over all configurations, without any information on where they originated from. Links to source code work sporadically. In a matrix build, compiler warnings listing identifies which build produced the warning, and all source code linking works. It is also possible to access compiler warnings produced by a single configuration by clicking on that configuration in the matrix.

Additionally, the success/unstable/failed status for individual configurations shown in the workflow builds front page is not unstable if there are only compiler warnings. So the whole workflow can be unstable even if all the individual configurations show success. To fix this, we would need to 1) be able to run the warnings plugin separately for each configuration, and 2) somehow get the information from the warnings plugin on whether there were warnings. For the latter, either the step should return the status to the workflow Groovy script, or we would need a way to access the console log from the Groovy script to be able to extract the information from the console log that the warnings plugin writes.

In a workflow build, failed tests can only be listed as one big list. The only way to see in which configuration a test failed is to click on the test (the configuration is currently shown under "Standard Output" heading). In a matrix build, clicking on the unstable ball in the matrix allows one to see just the tests that failed in that configuration.

I just now noticed that in a matrix build, you can also click on the failing tests (either the full list, or individual tests), and in each case the web page already clearly identifies which configurations failed, and you can access the tests grouped by configuration also from there. This is not there for the workflow build.

#68 Updated by Mark Abraham about 4 years ago

Thanks for the summary!

I had a dig in the various code this afternoon, and I think Workflow isn't ready enough for us now. Hopefully it will be ready in a few months' time and we can reap the benefits of the work you guys have put in. Are there other issues with Workflow (or other plugins' interaction with it) that are worth reporting upstream?

#69 Updated by Teemu Murtola about 4 years ago

With the workflow plugin outruled for now, I played a bit with the parameterized trigger approach.

First finding is that it appears that the "Other jobs on which this job depends" Gerrit Trigger option does not work together with Parameterized Trigger: you will get two builds triggered for the downstream build, one by Gerrit Trigger without the OPTIONS parameter set, and another by Parameterized Trigger. So this means that it is probably impossible to trigger the matrix build from a post-build step and still get reasonable interoperability with Gerrit Trigger.

So the only alternative is to use a blocking build step, for which the major disadvantage is that then the launcher build will occupy a second executor slot for the duration of the whole build, although it is not doing anything. We need to think about what to do with these executors so that they don't impact load distribution on Jenkins too much.

I think I managed to get the essential pieces in place for Gromacs_Gerrit_master-matrix-from-repo to actually work together with Gerrit Trigger, and to post the URL to the downstream build to Gerrit instead of the launcher URL. It should also gracefully fall back to posting the launcher job URL if there is actually a problem in the launcher job. The only limitation with the current approach is that the actual matrix job can only have alphanumeric characters and underscores in its name, but if we name it Gromacs_Gerrit_master, it should not be a big issue.

Is this the way we are going to take for now?

#70 Updated by Teemu Murtola about 4 years ago

The main disadvantage (on top of the extra executor overhead) over the current matrix build is that in order to retrigger the matrix build, you need one more click (to navigate to the launcher build) after you've determined that retriggering could solve the problem.

#71 Updated by Mark Abraham about 4 years ago

Teemu Murtola wrote:

With the workflow plugin outruled for now, I played a bit with the parameterized trigger approach.

So the only alternative is to use a blocking build step, for which the major disadvantage is that then the launcher build will occupy a second executor slot for the duration of the whole build, although it is not doing anything. We need to think about what to do with these executors so that they don't impact load distribution on Jenkins too much.

OK. How about I arrange to dedicate some slave (e.g a new VM slave with a few cores) to running the launcher builds. We give it whatever (large-ish) number of executor slots we think works well enough. That's not ideal, but as we move to docker-based builds (running on real hardware that is currently unused), then our need for the existing VM-based slaves will drop.

I think I managed to get the essential pieces in place for Gromacs_Gerrit_master-matrix-from-repo to actually work together with Gerrit Trigger, and to post the URL to the downstream build to Gerrit instead of the launcher URL. It should also gracefully fall back to posting the launcher job URL if there is actually a problem in the launcher job. The only limitation with the current approach is that the actual matrix job can only have alphanumeric characters and underscores in its name, but if we name it Gromacs_Gerrit_master, it should not be a big issue.

Is this the way we are going to take for now?

Yes. In particular, we want to re-extend SIMD coverage so we can get Erik's SIMD patches decently tested before we submit them. Then he plans to work on new Verlet-scheme kernel generation, while others of us get the rest of the infrastructure sorted out, so that we might rip out the group scheme, hopefully by Christmas.

I'll have a think about how matrix gaps should be covered, now that they're in-repo.

#72 Updated by Teemu Murtola about 4 years ago

  • Project changed from GROMACS to Support Platforms
  • Description updated (diff)
  • Category set to Jenkins
  • Status changed from New to In Progress

Updated the description to collect the current status from the long discussion. Please update if I missed something.

#73 Updated by Mark Abraham about 4 years ago

Update looks good. I've asked Stefan to organize us a VM we can use for the launcher builds. Will look at label/host/capability issues now.

#74 Updated by Mark Abraham about 4 years ago

We now have bs_nix-matrix_master, which is currently a VM backed by 2 real cores, running 10 executor slots. I made Gromacs_Gerrit_master-matrix-from-repo (and the matrix parent part of Gromacs_Gerrit_master_nrwpo) always run on this new slave. It now works on a test build, and should continue to work fine under real load. The expected workload is lightweight, but we can play with those parameters as need arises.

#75 Updated by Mark Abraham about 4 years ago

  • Description updated (diff)

updated description to reflect progress

#76 Updated by Roland Schulz about 4 years ago

The option to comment in gerrit without voting has been implemented: https://issues.jenkins-ci.org/browse/JENKINS-30393. I can compile the git version if you want to use that version.

#77 Updated by Teemu Murtola about 4 years ago

Roland Schulz wrote:

The option to comment in gerrit without voting has been implemented: https://issues.jenkins-ci.org/browse/JENKINS-30393. I can compile the git version if you want to use that version.

There have already been a few official releases with the issue fixed, so it should be sufficient to upgrade to the latest. There's so much happening right now that I'm not sure whether anyone has time to put in much effort to use it, but I can try to do something with the coverage job.

#78 Updated by Teemu Murtola almost 4 years ago

  • Description updated (diff)

I now disabled the old per-patchset coverage builds (they were anyways not working...), and added a new nightly coverage job that runs with the new releng script.

#79 Updated by Teemu Murtola almost 4 years ago

  • Description updated (diff)

Someone upgraded Jenkins and the Gerrit Trigger plugin, so the on-demand coverage job now works.

#80 Updated by Teemu Murtola almost 4 years ago

  • Description updated (diff)

Updated the description for the current status on replacing packaging and Deploy_* jobs for releasing.

Also updated other status.

#81 Updated by Mark Abraham almost 4 years ago

I updated/installed Pipeline (formerly workflow), HTML publisher, Groovy postbuild, and timestamper

#82 Updated by Teemu Murtola almost 4 years ago

Mark Abraham wrote:

I updated/installed Pipeline (formerly workflow), HTML publisher, Groovy postbuild, and timestamper

Thanks, now the Release_workflow_master can give a lot more useful information on the build front page (the last successful build shows the current status). There's still work to be done there, but a lot of this will be reusable if we in the future start using the workflow builds for something else as well.

#83 Updated by Teemu Murtola almost 4 years ago

I implemented the necessary changes to Gerrit Trigger to make it possible for us to replace the matrix launcher job with a workflow job: https://github.com/jenkinsci/gerrit-trigger-plugin/pull/274

If someone has extra time, we could compile a custom build of the plugin and test it, or we can wait until the pull request (hopefully) gets merged. But I cannot easily set up an environment to test my changes with a real Gerrit instance...

#84 Updated by Teemu Murtola over 3 years ago

  • Description updated (diff)

Updated description to reflect current status. The list of things to do is probably starting to decrease, so perhaps we can close this issue when the remaining TODOs mentined in the description are done.

#85 Updated by Teemu Murtola over 3 years ago

  • Description updated (diff)

Updated status for AddressSanitizer/LeakSanitizer.

#86 Updated by Roland Schulz over 3 years ago

Do we know which of the pipeline issues we had are still unresolved in Jenkins? I think it would be very valuable to both us (to get it fixed and know the status) and the Jenkins devs if we filed an issue for each issue which is still unresolved.

#87 Updated by Teemu Murtola over 3 years ago

I think at least these issues are among those that were blocking us (and these cover nearly all the TODOs in the draft change):

There are so many open issues that it's a bit difficult to find whether other relevant ones are there. One that likely has a lot of related issues already reported is our desire to get access to the URL of the console output of an individual step, even if that exact issue might not be.

#88 Updated by Teemu Murtola over 3 years ago

One more relevant issue for the workflow replacing the matrix build is this:

#89 Updated by Mark Abraham over 3 years ago

Found this talk from DockerCon 2016 that pretty much has already implemented the kind of thing I was going to work on shortly. It has links embedded, etc. https://youtu.be/YViFZBoKqjg

#90 Updated by Roland Schulz over 3 years ago

The youtube link doesn't work.

#91 Updated by Mark Abraham over 3 years ago

Roland Schulz wrote:

The youtube link doesn't work.

Sorry, I watched off my phone and didn't get a good link, clearly. Fixed.

#92 Updated by Teemu Murtola about 3 years ago

  • Description updated (diff)

Updated the status for several items that are now done. The on-demand [JENKINS] mechanism has also been greatly extended with a workflow build, although it wasn't in the description here.

If people have other things that they would like to improve in the Jenkins builds, I suggest creating separate issue(s) to record those.

Also available in: Atom PDF