PHP-FPM hangs on test-3
Background
test-3
is a bare-metal host running Debian testing (bookworm
) with bknix's install-runner.sh and install-demo.sh. These scripts create a user dispatcher
who can create/run homerdo-style containers.
test-3
was working for a while - until it wasn't. The initial symptom was that calls to iptables -t nat
failed. This command relies on a kernel module nf_nat.ko
, which was not loading. dmesg
showed:
[1425428.654285] failed to validate module [nf_nat] BTF: -22
[1425428.662088] missing module BTF, cannot register kfuncs
The error indicated a kernel data-structure ("BTF") was malformed. There was an unattended upgrade for linux-image-*.deb
somewhere in the prior week (probably without a corresponding reboot). It seems that the in-memory kernel and the on-disk *.ko
files got out-of-sync. Rebooting fixed this (booting anew with the on-disk vmlinuz
that matches the on-disk *.ko
s).
Unfortunately, rebooting broke a lot other things (or revealed latent issues). The br0
config files (e.g. /etc/systemd/network/50-br0.network
) had a typo in the IP address, so networking didn't come up. (That's fixed.) The DNS
options for 50-br0.network
would not propagate to /etc/resolv.conf
. (That now has a work-around via /etc/resolvconf/resolv.conf.d/head
.) The networking is working now.
There's one more issue remaining from the reboot -- PHP-FPM hanging.
CiviCRM-Core-Matrix.job
Steps to reproduce via One way to reproduce this is to run an E2E test-job.
sudo -iu dispatcher bash
env CIVIVER=master SUITES=phpunit-e2e run-bknix-job --mock max CiviCRM-Core-Matrix
The job will start, and several tests will pass. The problem appears after E2E\Core\AssetBuilderTest
- while running the next test (E2E\Core\ErrorTest
) which sends an HTTP request (via Guzzle). The request is routed through Apache to PHP-FPM which fails to respond. It appears that the test-suite hangs.
After it hangs, and after you lose patience, press Ctrl-C
to shutdown the job.
These steps are nice because there's only one command to fire. However, it's a little awkward to open a shell and manually investigate.
NOTE: The
run-bknix-job
is intended to run inside Jenkins. The--mock
option will fill-in some placeholder values so that you can run jobs without Jenkins. However,--mock
is limited to one job at a time. Be sure to shutdown (Ctrl-C
) before invoking it again. If you try to concurrently mock Jenkins multiple times, then it may produce errors about MySQL initialization.
demo
service
Steps to reproduce via Here's another way to reproduce the problem - make a web-site in the demo
environment.
## Open a shell
ssh test-3.civicrm.org
sudo -iu dispatcher -- homerdo enter -i images/demo.img -- use-bknix max -cs
## Display info about demo "max" services
loco status
loco info
## Create an empty site with some basic examples
## NOTE: At time of writing, I've already done this. But if we restart demos, then you'll need to do again.
civibuild create mymax --type empty
echo '<?php echo "Hello";' > build/mymax/web/hi.php
echo 'Hello' > build/mymax/web/hi.txt
This creates a site mymax.test-3.civicrm.org
with two files (hi.txt
and hi.php
). Here are several ways to display hi.txt
and hi.php
. All should show "Hello".
## Access the file directly - via CLI
php build/mymax/web/hi.php ## PHP-CLI
cat build/mymax/web/hi.txt ## Basic file
## Access the main httpd for "max" sites. This is :8003.
curl 'http://mymax.test-3.civicrm.org:8003/hi.txt' ## Apache file-serving
curl 'http://mymax.test-3.civicrm.org:8003/hi.php' ## PHP-FPM
## Access a public-facing (reverse-proxy) via :80. This forwards to :8003.
curl 'http://mymax.test-3.civicrm.org/hi.txt' ## Apache file-serving
curl 'http://mymax.test-3.civicrm.org/hi.php' ## PHP-FPM
## Connect directly to PHP-FPM over TCP
SCRIPT_NAME=/hi.php SCRIPT_FILENAME=$PWD/build/mymax/web/hi.php REQUEST_METHOD=GET \
cgi-fcgi -bind -connect 127.0.0.1:9011
You will see that requests to PHP-FPM hang. But PHP-CLI and Apache file-serving work fine.
Deep-dive: PHP-FPM in foreground or background
While debugging, I found one interesting clue: PHP-FPM on test-3
works fine if PHP-FPM is launched via loco run
(foreground execution). It only suffers from hangs if launched via loco start
(background execution). To play with these, here are some key commands:
-
loco status
: Show status of php-fpm background process. -
loco info
: Show more info about how php-fpm is launched (config files, port numbers, etc) -
loco stop php-fpm
: Stop the background task -
loco start php-fpm
: Startphp-fpm
in background mode -
loco run php-fpm
: Runphp-fpm
in foreground mode
The effect is:
- If you use
loco start php-fpm
, then requests (viacurl
orcgi-fcgi
) hang. - If you use
loco run php-fpm
, then requests (viacurl
orcgi-fcgi
) work fine.
Comparison
The same scripts work in several environments. This includes:
System | Host Env | PHP-FPM/Loco Behavior | Comment |
---|---|---|---|
test-1 |
Debian Buster (x64) | ? | Not tested |
test-3 (prior to reboot) |
Debian Bookworm (x64) | OK | |
test-3 (after reboot) |
Debian Bookworm (x64) | Hangs | Depends on launch-style (loco start php-fpm vs loco run php-fpm ) |
test-4 |
Ubuntu Jammy Server (x64) | OK | |
bknix-run-* (gcloud ephemeral hosts) |
Ubuntu Jammy Server (x64) | OK | |
Local desktop | Ubuntu Jammy Desktop (x64) | OK | |
Local desktop: VM (bkvm ) |
Ubuntu Jammy Server (x64) | OK | |
Local laptop | MacOS 12.6 (arm64) | OK | |
Local laptop: VM (bookie ) |
Debian Bookworm (arm64) | OK | Different CPU |
Commentary
- To spin-up a clean
bookworm
VM, I usually use gcloud or Digital Ocean. Had to fallback to local VM becausebookworm[testing]
isn't available in those cloud media. - In ancient times, I used Debian
testing
as a main desktop OS for >1 year. I appreciate that it has both periods of stability and periods of flakiness. - I can imagine many hypotheses for the PHP-FPM problem. It could point to a bug or compatibility-issue or configuration-issue somewhere in
loco
orlinux-image*.deb
orutil-linux
orhomerdo
orsysctl
orpam_limits
or yaddaydada. It could also point to some "test-only monkeywrench" -- where an intermediate revision of a*.deb
or script-file left some weird artifact that only afflicts hosts that previously ran a specific revision.
The way I see, here a couple ways to go at this:
-
Push Forward: Take some of the current clues and dig-in to identify critical differences. In particular:
- Within
test-3
, why doesphp-fpm
work withloco run
but notloco start
? Maybe investigating this leads to a resolution. - Comparing
test-3
to other systems (test-4
,bookie
, andtest-3
-before-last-week), why is it onlytest-3
-now that has trouble withphp-fpm
? Maybe investigating this leads to a resolution.
- Within
-
Tactical Retreat: Reset
test-3
to a clean baseline. Rebuild with Ubuntujammy
and reruninstall-runner.sh
.- (When
bookworm===stable
, then spin-up a VM and see if the problem is still there.)
- (When
I feel like I spent a lot of time trying to push-forward. This had some upside: bookworm[testing]
on test-3
did lead to a few improvements for forward compatibility, and the test-stack was working on test-3
. But test-3
also had these rabbit holes (re: *.ko
, DNS
) that did nothing for future compatibility (and made me cranky).
From where we stand, I can't tell for sure if the PHP-FPM problem is a real bug or a test-only monkey-wrench.
Personally... I'm ready for the tactical retreat...