Delay Analysis (5.5.0)
The 5.5.0 release was scheduled for publication on Sept 5, 2018, but the release was delayed until Sept 11, 2018. This issue aims to report back on the delays -- and to discuss improvements/mitigations going forward.
Key details
-
Publication is heavily dependent on automation -- in a typical release cycle, we rely on automated systems to prepare draft tarballs; to run the main test suite in multiple environments; to run combinatorial tests with those environments and various extensions; to prepare demo sites where we can try out the installation routine from a user-level perspective; to categorize pending PRs; to test PRs; etc. These systems provide important signals that we always check during publication. Automating them makes the release process faster. In the happiest case, automation means that we don't work as hard on publication -- but often it just means we have capacity left to investigate false-negatives and other late-stage issues (e.g. regressions reported during RC period).
-
Most of the automated tasks were run through a distributed job manager (Jenkins), but they were coupled to particular servers. For example:
- Nightly tests of Civi ran against a minimal environment (eg PHP 5.5, MySQL 5.5) on the VM
test-ubu1204-1
(on a physical server-cluster namedganeti
). - Nightly tests of Civi ran against a newer environment (eg PHP 7.0, MySQL 5.7) on the VM
test-ubu1604-1
(on a physical server namedpadthai
). - PR tests and tarball builds ran on an in-between environment (e.g. PHP 5.5 and MySQL 5.7) on the VM
test-ubu1204-5
(on a physical server namedpadthai
). - Bots dealing with Git/Github administration ran on the VM
botdylan
(on a physical hostpadthai
).
- Nightly tests of Civi ran against a minimal environment (eg PHP 5.5, MySQL 5.5) on the VM
-
Initially, the servers were scoped to be minimalist instances of the stock Ubuntu configurations. These particular VMs were generally designed to be ephemeral/replaceable, and they generate a tremendous amount of short-term data, so very little backup was provided. Some important parts of the configuration were put into automated form (ansible/bash) and some parts were handled manually. Of course, you'll be unsurprised to hear that (over time) the untracked manual parts grew. The conceptual goal of tracking stock was abandoned; OS's were upgraded; PPAs and other non-standard binaries were added; etc. Each step individually represented a calculated cost-benefit -- but the overall result was an accumulation of technical debt (in the form of manual configuration).
-
During the week before the release, the operations took two hits: firstly, I took a few days vacation before release day (which reduced our admin capacity). Concurrently, the physical server (
padthai
) had a RAID failure that took down several VMs (test-ubu1204-5
,test-ubu1604-1
,botdylan
,www-demo
). bgm worked with the hosting provider to get the basepadthai
back online during the days before release, but there was too much debt to get everything back quickly. Net result: when I returned on release day, our QA signals were offline, and we had a long way to go in restoring them. It ultimately took us several more days' work to get the release-critical parts back. -
Addressing this was a team effort -- with Mathieu working on the base system ops and
www-test
while Seamus investigated test-failures and compatibility problems between the code&new environment. I worked on the setup scripts/jobs for worker-nodes and eventually the regular release workflow. And Eileen offered her sage support throughout. I can't speak for how much time went in overall -- but, for my part, this added a few 10 hour days to the process.
Analysis
- Was it completely necessary to delay and to pay those costs? Consider the responses we could take on release day:
- Ship while blind (without our normal QA signals). This would meet the announced schedule, but it could blow up in our face if an obvious problem made it through.
- Fallback to doing more release tasks by hand. From CYA perspective, this would be safer than shipping blind; and it would have provided a more timely release date. But the labor involved would still be nontrivial, and we'd be no better off when the next release (5.5.1 or 5.6.0) came along.
- Re-commit to automating the infrastructure. This would delay 5.5.0 longer, but it would leave us in a better position for publishing the next release and for making the test infra more robust. This is ultimately the path I went down for a couple reasons:
- The x.x.0 release is not a security-release or a regression patch-release. Letting the schedule slip a few days means missing a benchmark -- but it's not gonna kill anyone.
- Once x.x.0 goes out, there's often a follow-up x.x.1, but it's hard to say when it'll happen. I really don't want to be in a position where we have a regression while our QA signals are all dark. This turned out to be a fortunate calculation -- because we had a 5.5.1 regression-fix on the very next day. (Cheers to Noah!) The regression release went out smoothly.
- I find myself somewhat ambivalent about the accumulation of technical debt which produced this situation. IIRC, we discussed the manual quirks in real-time but didn't see an affordable alternative in the moment. So, naturally, we acquired a little technical debt that ultimately had to be re-paid under-the-gun. I suspect it's a familiar story for anyone reading this.
Remediation
The up-shot of the delay is that we were forced to pay down some technical debt -- i.e. improving the scripted definitions of the worker-nodes. The revised definitions:
- Include multiple versions of PHP and MySQL on the same host. Instead of splitting jobs among three qualitatively different VMs (
test-ubu1204-1
,test-ubu1204-5
,test-ubu1604-1
) to handle different binaries/versions, one VM (test-1
) can handle different mixes of PHP/MySQL. This means that nodes are more interchangeable, and it's easier (than before) to reproduce/scale-up/scale-down. - Can be installed on OS X and most Linux variants with a couple short commands without interfering with the main OS services/packages. (For the most part, the scripts originate as my local dev env -- where they needed to coexist with other stacks.)
See also: https://github.com/civicrm/civicrm-infra/blob/master/continuous-integration/worker-gen-3.md