Thursday, 25 July 2013

Jenkins comes to NAG

We build a lot of Fortran and C code as part of our Libraries and Compiler, and we have over 40 years experience in software engineering and in numerical analysis and algorithms. However, not being a large organization means that it can be hard to keep up with the vanguard of new SE methodologies and frameworks. There are some approaches though that seem a no brainer to have as part of the workflow in a modern software company. Convergent evolution can lead to their development and adoption without any prompting from the greater community.

For decades we've followed and benefited from many of the principles of continuous integration (for example, version control and build automation---but no longer with SCCS or shell scripts!) and we've been running a self-hosted and self-maintained manager for automatic builds since the early 2000s. But we began to find the system showing its age and straining at the seams. The build manager uses dumb ssh-ing to launch each build on a slave, so there is no communication channel between a build and the driver. The build indexer has become quite unmaintainable and is, shall we say, rustic: it resembles a web page from 1995.

We don't have the resources to pump any significant improvements into these systems, so a couple of years ago we had a look around for replacements.

Buildbot is nice. We liked the visualizations it comes with out of the box, and some of us know Python so we felt confident we'd be able to set up the software. There's no way of configuring by a GUI though, which restricts for us the audience of potential maintainers.

Hudson seemed too Java focused to fit easily into our model.

Then we lost some personnel and so the investigation stalled. It remained 1995 for a few more years.

In 2011 we audited and codified many of our processes; in December 2011 we were awarded ISO 9001 certification. There was a renewed interest in enhancing the reporting that we use for measuring code health. At around this time we also made major changes to our release schedules. We needed the greatest amount of automation, the smoothest development process, the most informative reporting of health that we could manage.

In the time since we looked at Buildbot and Hudson we'd heard good things about Jenkins, a fork of Hudson (but no relation to Leeroy). I tried Jenkins out on some small projects and I saw good things. The GUI makes it a doddle to experiment with a set up. There are lots of nice, relatively-mature and well-maintained plugins for email notifications, tracking violations of user-defined coding standards, reporting on code coverage, Chuck Norris, ... So we set up a small group to prototype a new configuration for use by the whole company, and then, all going well, to port over the old system.

The NAG Jenkins is getting off the ground at the moment. Here are some things we've implemented and discovered (i.e., been burned by).

• We want as many of the slave workspaces as possible to be on an NFS disk. In practice this means for all Unix-like slaves. With this configuration we can easily, at a low level, peek at the builds if necessary - using find, or whatever. Initially we went as far as using a single workspace for every such node: something like /users/jenkins/workspace/. This is bad! The Remote FS root for a node must be unique to that node. Otherwise every node that launches will unpack the Jenkins .jar files into the same directory. If there are different Java runtimes accessing these same files then sporadic End of File errors will occur (presumably because of the different runtime states becoming corrupted or confused).

• As a fun exercise in getting to know more about how Jenkins plugins work and how to develop them, we wrote a warnings parser for the NAG Fortran Compiler and added it to the Jenkins warnings plugin.

• Some of our projects are pretty monolithic. We were hoping to use the HTML Publisher plugin to publish our build reports for easy access from a job's page. As far as we could see from its source code, this plugin does recursive directory traversals and accumulates its findings in memory. On some of our older slaves, with the workspace on an NFS disk, this was just taking an intolerable amount of time. As a compromise a post-build Archive the artifacts is good enough for us instead.

• There were resource problems when trying to report code coverage using gcov, gcovr and Coberatura as described by SEMIPOL. Our gcov-instrumented build generates 20,000 *.gcda files and 23,000 *.gcno files. This is too much data for gcovr to aggregate, which it tries to do by building an XML DOM tree in memory. In the end we just use lcov to scan the gcov files, and then we use the HTML Publisher to publish the report index.

• Mac slaves need to run headlessly.

• Controlling Windows slaves as a service requires you to modify the user rights settings (although, as Jenkins tells us on the configure page for such a node, we probably deserve grief if we use this launch method). The Windows user launching the slave needs to be assigned the right to Log on as a service: start secpol and add the user under Local Policies -> User Rights Assignments -> Log on as a service.

• We have a version-controlled $JENKINS_HOME, which Jenkins itself backs up as a periodic job (although I understand that there's a plugin which does a similar thing). Inspired by Keeping your configuration and data in Subversion we use a parametrized job that runs #!/bin/tcshcd$JENKINS_HOME
svn add -q --parents *.xml jobs/*/config.xml users/*/config.xml userContent/* >& /dev/null
svn pd -q svn:mime-type *.xml jobs/*/config.xml users/*/config.xml userContent/*.xml
echo "warnlog\n*.log\n*.tmp\n*.old\n*.bak\n*.jar\n*.json\n*.lck\n.owner" > myignores
echo "identity.key\njenkins\njenkins.security*\nlog*\nplugins*\nsecret*\nupdates" >> myignores
svn ps -q svn:ignore -F myignores .
rm myignores
echo "builds\nlast*\nnext*\n*.txt\n*.log\nworkspace*" > myjobignores
svn ps -q svn:ignore -F myjobignores jobs/*
rm myjobignores
svn status | grep '!' | awk '{print $2;}' | xargs -r svn rm svn ci --non-interactive --username=jenkins -m "$1"if (\$status != 0) then
exit 1
endif
svn st


• The facility for multi-configuration projects seemed attractive for one of our jobs, which parametrizes especially easily over the source-code branch and target architecture. We need to have all builds in the matrix share the same workspace and to build in a shared checkout. By default each build in a matrix project runs in its own workspace, but as with a free-style project this is easy to configure: select Use custom workspace in the Advance Project Options of the matrix job, which will then uncover a Directory for sub-builds field. Also by default the master job performs the designated source-code step (e.g., Emulate clean checkout) but then so does each child job in the matrix! So all of the jobs in our matrix clash as they try to manipulate the checkout at the same time. There doesn't appear to be a way to customize this short of doing the correct work you want using explicit job commands. So for the time being we're going to maintain individual and extremely similar jobs for this matrix. With some strong command-line fu we can easily batch modify all the configurations for these jobs when necessary.

Currently we have a roster of 24 Library builds in different configurations, plus a further 20 or so jobs for additional tests of our development infrastructure and for bookkeeping. There are a few next steps we'd like to try out.

• We want to see how far we can go towards configuring a full implementation of a Library, from a clean checkout all the way through to a final tarball to send for review and QA. The nature of some of the algorithms we are testing makes it difficult to enable entirely-automatic verification, but clearly but we should be able to use Jenkins to create a workable first approximation.

• Our canonical sanity-checking Library build is made using the NAG Fortran Compiler. All the current Library builds in Jenkins run nightly, so we need to look at making the checking build more continuous, after each commit (modulo some quiet period).

• Other than the main job page for our projects we don't have any clever notifications enabled for tracking or visualizing the health of the jobs. Hopefully we can sort out some nice graphics to show on a TV.

We're pretty excited about what we can do with Jenkins and what we can get from it. For automatic builds at NAG the future is looking bright; the future is looking blue (well, green).