test-3 is the new trouble-maker
Background and Symptoms
When I opened #989 (closed), the central complaint was that test-1
was behaving worse than its newer sibling test-3
. This was based on anecdotal reading of build logs over several months - e.g. test-1
frequently had unusual timeouts/crashes. test-3
didn't -- in fact, for apples-to-apples runtime comparisons, it tended to be the fastest runner.
Now... just as we've figured+resolved a theory about test-1
, I'm seeing (anecdotally) similar misbehavior on test-3
. CiviCRM-Core-Matrix
(phpunit-api3
and phpunit-crm
, in various environments). A month ago, these would take 30-45m. Here they are running for 15 hours (1, 2, 3). Those jobs did make slow progress (which is why Jenkins didn't kill the jobs). But clearly not working.
The final theory on #989 (closed) was that we wore-out the SSDs on test-1
(ie 450+ TBW actual vs 300 TBW rated). It does appear that test-3
has pretty advanced wear (ie 380+ TBW actual vs 220 TBW rated). This feels plausible:
-
test-1
older - so it aged out sooner. -
test-1
andtest-3
run many of the same tasks. The tasks involve generating new test sites from git -- and generating one site costs about 350-800 MBW (megabytes-written). - In the past few months, both
test-1
andtest-3
had their load ramped-up because we switched to parallel task execution. (Roughly: with the old/singular test-job, one PR generated 350-800 MBW; with newer/split test-job, there are 10x test-runs in parallel -- which means 3500-8000 MBW.) - More recently, after
test-1
became intolerable, we suspended it from the cluster for a week or two. But those jobs had to go somewhere -- sotest-3
probably bore the brunt of it.
Still, the history on test-3
shows several smaller, successful jobs from the past week. For comparison: test-3
today feels (to me) like test-1
of late 2022. It's flaking but not totally.
Interventions (General)
There are few general ideas:
- Reduce the the incremental SSD writes per test (so that we don't run through TBW so quickly)
- Move test data from SSD to ramdisk (which doesn't suffer TBW limits)
(Both are easier said than done...)
I've been working (off-and-on) to reshape the test-scripts so that we can reliably fit most of the test-runs in ramdisk. This isn't quite ready yet though.
Intervention (Specific)
Use test-3
as a guinea pig -- put /home/jenkins/**/build
into ram. As happenstance, the physical host has 20-30gb of unallocated RAM, so we might fit it. I think this hack would achieve that
## Stop services...
## Keep the old home as a baseline/snapshot
mv /home/jenkins /home/jenkins.base
mkdir /home/jenkins
## Make a temp space where most work happens
mkdir /home/jenkins/temp
mount -t tmpfs -o size=25g tmpfs /home/jenkins/temp
mkdir -p /home/jenkins.temp/{up,work}
## Combine them together
mount -t overlay overlay \
lowerdir=/home/jenkins.base,upperdir=/home/jenkins.temp/up,workdir=/home/jenkins.temp/work \
/home/jenkins
## Start services
This is specifically a quick hack -- to alleviate current symptoms and try-out some techniques. The overall process
- In the interim, if there are any reboots or php-mysql upgrades, that kind of maintenance will require manual intervention.
- Continue working on replacement scripts.
- In a few weeks, decommission
test-3
and spin-up similar replacement (similar price; newer CPU+SSD) using newer scripts.