test-1 is a constant trouble-maker

Background

test-1 has the following:

Jenkins connects as user jenkins to create interactive test-sites and run PHPUnit
- (This has recently been reduced. civicrm-core PRs no longer do phpunit here. It may be used for interactive test-sites and for testing alternate repos civicrm-{drupal,backdrop,wordpress}.)
Jenkins connects as user publisher to build various tarballs/zips/phars, to sync'ing l10n for ext's, and so on.
Gitlab connects through gitlab-runner to do some small tasks

Major Symptoms

mysqld frequently crashes during phpunit test-runs. (Which is why I've stopped sending most test-runs to it.) Other hosts don't do this. It's always test-1.
A few days after reducing its load, I found that even simpler jobs like Tool-Publish-civix were still failing. So I did a more controlled experiment.
- Confirm the system is not running any jobs.
- Manually run a similar task
  - ssh in and become user publisher
  - cd ~/workspace/Tool-Publish-civix
  - nix-shell --run ./build.sh
  - (This uses php-cli to run composer and box; there are no system-services involved)
- It always times-out on test-1. Example failure.
  - In this case, note that composer config merely updates the composer.json. yet it fails after hanging for 60sec.
- The same operation succeeds on other systems (incl localhost, test-2, and a busy instance of bknix-XXXXX VM). It usually takes 10-20s and prints a message about using (IIRC) 70-100mb RAM.
- I've repeated this a few times over a few days.
(Unmeasured/unverified) When test-3 was first introduced, it only slightly faster than test-1. Now it feels significantly faster. They run the same software+configurations. Since both machines are reserved/fixed hardware, it feels more likely that test-1 has fallen back (rather than test-3 getting faster).

Theories

Capacity/Load Theory: test-1 is given too many tasks (relative to its hardware/capacity).
- Possible Interventions: Give it less to do. Add other hardware to take the slack.
- Pro: It does ordinarily/traditionally do a lot. Reducing load has alleviated symptoms in the past.
- Con: It flakes out, even if there's negligible load. (But this "Con" may not be decisive. I didn't go as far as stopping background services -- I only confirmed that Jenkins was not sending work.)
System Image Theory: The software or configuration on test-1 system-image is buggy or error-prone.
- Possible Interventions: Rebuild test-1. Ensure software config matches other systems (which work better).
- Pro: It's probably the oldest continuously-operating system-image (among test-servers). There are multiple layers in it (Debian/Jenkins/Ansible/Gitlab/Nix/Docker/etc), and there may be cracks between them. Qualitatively, the system has had more jobs than others.
- Con: The software is mostly the same as other hosts -- and mostly scripted. Other systems with similar software haven't had the same problems. (But this "Con" may not be decisive -- because others may not have the same age or same interactions between subsystems.)
Hardware/Bad-SSD/Bad-RAM Theory: The
- Possible Interventions: Ask OVH to move us to a new box, cancel+credit the contract on the box, or swap drives. We may need to rebuild the systems (depending on the details of that.).
- Pro: The hardware is several years old, and it's been used intensively. smartctl reports "Data Units Written" as 450+ TB(W). The manufacturer rates the SSD for 300 TB(W).
- Con: smartctl hasn't reported any specific errors. (But this "Con" may not be decisive. In a case of degrading IO performance, a super-slow IO call may eventually succeed in its low-level task - but the delay would still cause higher-level tasks to fail.)
- Con: Other VMs on paella haven't reported similar problems. (But this "Con" may not be decisive. Other VMs have fewer services and fewer tasks, and they don't get as much logging or attention.)

Edited Feb 28, 2023 by totten

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information