test-1 is a constant trouble-maker
Background
test-1
has the following:
- Jenkins connects as user
jenkins
to create interactive test-sites and run PHPUnit- (This has recently been reduced.
civicrm-core
PRs no longer do phpunit here. It may be used for interactive test-sites and for testing alternate reposcivicrm-{drupal,backdrop,wordpress}
.)
- (This has recently been reduced.
- Jenkins connects as user
publisher
to build various tarballs/zips/phars, to sync'ing l10n for ext's, and so on. - Gitlab connects through
gitlab-runner
to do some small tasks
Major Symptoms
-
mysqld
frequently crashes during phpunit test-runs. (Which is why I've stopped sending most test-runs to it.) Other hosts don't do this. It's alwaystest-1
. - A few days after reducing its load, I found that even simpler jobs like
Tool-Publish-civix
were still failing. So I did a more controlled experiment.- Confirm the system is not running any jobs.
- Manually run a similar task
-
ssh
in and become userpublisher
cd ~/workspace/Tool-Publish-civix
nix-shell --run ./build.sh
- (This uses php-cli to run
composer
andbox
; there are no system-services involved)
-
- It always times-out on
test-1
. Example failure.- In this case, note that
composer config
merely updates thecomposer.json
. yet it fails after hanging for 60sec.
- In this case, note that
- The same operation succeeds on other systems (incl
localhost
,test-2
, and a busy instance ofbknix-XXXXX
VM). It usually takes 10-20s and prints a message about using (IIRC) 70-100mb RAM. - I've repeated this a few times over a few days.
- (Unmeasured/unverified) When
test-3
was first introduced, it only slightly faster thantest-1
. Now it feels significantly faster. They run the same software+configurations. Since both machines are reserved/fixed hardware, it feels more likely thattest-1
has fallen back (rather thantest-3
getting faster).
Theories
-
Capacity/Load Theory:
test-1
is given too many tasks (relative to its hardware/capacity).- Possible Interventions: Give it less to do. Add other hardware to take the slack.
- Pro: It does ordinarily/traditionally do a lot. Reducing load has alleviated symptoms in the past.
- Con: It flakes out, even if there's negligible load. (But this "Con" may not be decisive. I didn't go as far as stopping background services -- I only confirmed that Jenkins was not sending work.)
-
System Image Theory: The software or configuration on
test-1
system-image is buggy or error-prone.-
Possible Interventions: Rebuild
test-1
. Ensure software config matches other systems (which work better). - Pro: It's probably the oldest continuously-operating system-image (among test-servers). There are multiple layers in it (Debian/Jenkins/Ansible/Gitlab/Nix/Docker/etc), and there may be cracks between them. Qualitatively, the system has had more jobs than others.
- Con: The software is mostly the same as other hosts -- and mostly scripted. Other systems with similar software haven't had the same problems. (But this "Con" may not be decisive -- because others may not have the same age or same interactions between subsystems.)
-
Possible Interventions: Rebuild
-
Hardware/Bad-SSD/Bad-RAM Theory: The
- Possible Interventions: Ask OVH to move us to a new box, cancel+credit the contract on the box, or swap drives. We may need to rebuild the systems (depending on the details of that.).
-
Pro: The hardware is several years old, and it's been used intensively.
smartctl
reports "Data Units Written" as 450+ TB(W). The manufacturer rates the SSD for 300 TB(W). -
Con:
smartctl
hasn't reported any specific errors. (But this "Con" may not be decisive. In a case of degrading IO performance, a super-slow IO call may eventually succeed in its low-level task - but the delay would still cause higher-level tasks to fail.) -
Con: Other VMs on
paella
haven't reported similar problems. (But this "Con" may not be decisive. Other VMs have fewer services and fewer tasks, and they don't get as much logging or attention.)