Infrastructure issueshttps://lab.civicrm.org/groups/infra/-/issues2024-03-13T12:59:34Zhttps://lab.civicrm.org/infra/ops/-/issues/1020Update transifex (run string-extraction on civicrm core) - 5.722024-03-13T12:59:34ZbgmUpdate transifex (run string-extraction on civicrm core) - 5.72Last run on 2024-02-03 #1019.
https://lab.civicrm.org/dev/translation/-/wikis/Pushing-new-strings-to-transifex
- ESR: 5.69
- RC: 5.72Last run on 2024-02-03 #1019.
https://lab.civicrm.org/dev/translation/-/wikis/Pushing-new-strings-to-transifex
- ESR: 5.69
- RC: 5.72bgmbgmhttps://lab.civicrm.org/infra/ops/-/issues/1019Update transifex (run string-extraction on civicrm core) - 5.702024-03-12T20:44:08ZbgmUpdate transifex (run string-extraction on civicrm core) - 5.70Last run on 2023-06-17 #996.
https://lab.civicrm.org/dev/translation/-/wikis/Pushing-new-strings-to-transifex
- ESR: 5.69
- RC: 5.70Last run on 2023-06-17 #996.
https://lab.civicrm.org/dev/translation/-/wikis/Pushing-new-strings-to-transifex
- ESR: 5.69
- RC: 5.70https://lab.civicrm.org/infra/ops/-/issues/1018Move latest.civicrm.org from OSUOSL to paella.civicrm.org (OVH)2024-01-05T16:53:09ZbgmMove latest.civicrm.org from OSUOSL to paella.civicrm.org (OVH)* [x] OVH: assign an IP address, add a "virtual mac" of type OVH, use the full hostname as the vMAC name
* [x] KVM server: create a ZFS volume for the VM
* ex: `zfs create -s -V 70G [pool]/[name-of-vm]` (see `zpool status; zfs list`)
*...* [x] OVH: assign an IP address, add a "virtual mac" of type OVH, use the full hostname as the vMAC name
* [x] KVM server: create a ZFS volume for the VM
* ex: `zfs create -s -V 70G [pool]/[name-of-vm]` (see `zpool status; zfs list`)
* [x] Ansible: copy a relevant example from `host_vars/[vm]` for the new server, adapt values (hostname and IPs)
* [x] Ansible: add the hostname in the `hosts` file
* [x] Ansible: add the hostname as a preseed in `host_vars/kvm-foo.example.org` (the parent server)
* [x] Ansible: generate the preseed file: `ansible-playbook -l kvm-foo.example.org --tags kvm-server-preseeds ./site.yml`
* [x] KVM server: start the installation
* `ssh root@x[...].example.org`
* `/etc/preseeds/[hostname]/start.sh`
* [x] Change the preseed password
* [x] Ansible (create deploy user): `ansible-playbook -l xxxx.example.org -u myuser --become-user=root --ask-become-pass ./setup.yml`
* [x] Ansible (full installation): `ansible-playbook -l xxxx.example.org ./site.yml`
* [x] Test that the VM reboots cleanly
* [x] Migrate the services from latest.civicrm.org
* [x] latest.civicrm.org pingback service (and mysql DB)
* [x] stats.civicrm.org (static? deprecate?)
* [x] releaser files
* [x] Monitoring: update the host in Icinga and re-enable icinga on the new VM (was disable to avoid conflicts)
* [x] Update the A/AAAA records for latest.civicrm.org and stats.civicrm.org
* [x] OVH: configure rDNS for the IPv4 and IPv6 addresses
* [x] Backups: double-check that backups are running (they were initially disabled, not to caused conflicts with the current production VM)
* [x] Verify that monitoring is green
* [ ] Shutdown the old VM?
Old DNS records:
- 140.211.167.189
- 2605:bc80:3010:102:0:3:5:0
New DNS records:
- 192.95.2.135
- 2607:5300:203:6713:700::https://lab.civicrm.org/infra/ops/-/issues/1017Test runs do composer install before applying the PR patch, meaning changes t...2023-12-29T22:43:01ZDaveDTest runs do composer install before applying the PR patch, meaning changes to composer.json/lock aren't being testede.g. https://github.com/civicrm/civicrm-core/pull/28813
Possibly because of https://github.com/civicrm/civicrm-buildkit/commit/3081d83c648f6aa244f33380e7a3054deb515bf7#diff-652e839cd6819e54de3eafe2ac9126fa3da1d288d271a0fa681f28048111fa2...e.g. https://github.com/civicrm/civicrm-core/pull/28813
Possibly because of https://github.com/civicrm/civicrm-buildkit/commit/3081d83c648f6aa244f33380e7a3054deb515bf7#diff-652e839cd6819e54de3eafe2ac9126fa3da1d288d271a0fa681f28048111fa22R25 ?https://lab.civicrm.org/infra/gitlab/-/issues/44CiviCRM logo huge, has no width constraint on public/anon-user pages2023-12-23T19:06:03ZsavionleeCiviCRM logo huge, has no width constraint on public/anon-user pagesWhen trying to access the GitLab today on browsers that were not signed in, the logo was using its native height and width causing the header to cover a lot of the page.
![image](/uploads/93af557402e1fa769e45876ace62f5d0/image.png)
I c...When trying to access the GitLab today on browsers that were not signed in, the logo was using its native height and width causing the header to cover a lot of the page.
![image](/uploads/93af557402e1fa769e45876ace62f5d0/image.png)
I couldn't resolve it by disabling every extension, tracking prevention, and safe scripts.
To make it smaller, I used the inspector/devtools and gave it a width of 25px.
It doesn't appear to be an issue once you sign in.
Steps to replicate:
1. open a private window
2. navigate to https://lab.civicrm.orghttps://lab.civicrm.org/infra/ops/-/issues/1016Upgrade Java v17 for jenkins and nodes2023-12-11T20:28:09ZbgmUpgrade Java v17 for jenkins and nodeshttps://lab.civicrm.org/infra/ops/-/issues/1015Jenkins upgrade is broken2023-12-14T18:51:49ZbgmJenkins upgrade is brokenI noticed that we did not have the correct PGP key for newer Jenkins releases, so I updated that, and then Debian did an upgrade:
```
root@test:~# apt-cache policy jenkins
jenkins:
Installed: 2.375.3
Candidate: 2.375.3
Version tab...I noticed that we did not have the correct PGP key for newer Jenkins releases, so I updated that, and then Debian did an upgrade:
```
root@test:~# apt-cache policy jenkins
jenkins:
Installed: 2.375.3
Candidate: 2.375.3
Version table:
*** 2.375.3 500
500 http://pkg.jenkins.io/debian-stable binary/ Packages
100 /var/lib/dpkg/status
```
The upgrade:
```
Unpacking jenkins (2.426.1) over (2.375.3) ...
Setting up jenkins (2.426.1) ...
```
That worked fine (well, Jenkins started, idk for all plugins).
and then since the VM was running Debian 10/Buster, I upgraded it to Bullseye. Looking at the logs, I don't think it upgraded Jenkins again, but presumably upgraded dependencies (jdk?). After rebooting, Jenkins refused to start.
Example backtrace in logs:
```
Dec 08 18:47:39 test jenkins[1695]: 2023-12-08 18:47:39.660+0000 [id=29] WARNING hudson.ExtensionFinder$Sezpoz#scout: Failed to scout io.jenkins.plugins.analysis.warnings.groovy.ParserConfiguration
Dec 08 18:47:39 test jenkins[1695]: java.lang.ClassNotFoundException: edu.hm.hafner.util.NoSuchElementException
Dec 08 18:47:39 test jenkins[1695]: at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
Dec 08 18:47:39 test jenkins[1695]: at jenkins.util.URLClassLoader2.findClass(URLClassLoader2.java:35)
Dec 08 18:47:39 test jenkins[1695]: at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
Dec 08 18:47:39 test jenkins[1695]: at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
Dec 08 18:47:39 test jenkins[1695]: Caused: java.lang.NoClassDefFoundError: edu/hm/hafner/util/NoSuchElementException
```https://lab.civicrm.org/infra/ops/-/issues/1014Deprecate the l10n tar.gz2024-02-07T19:23:27ZbgmDeprecate the l10n tar.gzIf [PR#28139](https://github.com/civicrm/civicrm-core/pull/28139) is merged, we could get rid of the civicrm-l10n.tar.gz once the last ESR that relies on it is unsupported.
- [ ] Make sure that install docs and i18n guide (wiki page) ha...If [PR#28139](https://github.com/civicrm/civicrm-core/pull/28139) is merged, we could get rid of the civicrm-l10n.tar.gz once the last ESR that relies on it is unsupported.
- [ ] Make sure that install docs and i18n guide (wiki page) have instructions about the new way to install languages since https://github.com/civicrm/civicrm-core/pull/28061 (CiviCRM 5.69)
- [ ] dev/drupal#193 Deprecate the old installer (including from the Drupal7 drush module)
- [ ] Remove it from buildkit builds
- [ ] Remove the link from the civicrm.org/download page
- [ ] Stop generating the l10n tarballshttps://lab.civicrm.org/infra/ops/-/issues/1013mysql 8.0.29 being used for "max" but it can't even be downloaded anymore bec...2023-11-12T20:09:23ZDaveDmysql 8.0.29 being used for "max" but it can't even be downloaded anymore because it has a bad bugIt came up in another context that 8.0.29 is what's being used, but see https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-29.html
> This release is no longer available for download. It was removed due to a critical issue that cou...It came up in another context that 8.0.29 is what's being used, but see https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-29.html
> This release is no longer available for download. It was removed due to a critical issue that could cause data in InnoDB tables having added columns to be interpreted incorrectly. Please upgrade to MySQL 8.0.30 instead.
I don't know if the version in use is from a particular distribution that backports patches so maybe is not affected, but also the latest is 8.0.35.https://lab.civicrm.org/infra/ops/-/issues/1011Moving services off the OSUOSL cluster2024-01-04T22:26:43ZbgmMoving services off the OSUOSL clusterThe OSUOSL cluster was setup around 2014-2015, close to 10 years. They have been reliable and very cost-effective, thanks to OSUOSL. However, for performance and for planning the future, we should start slowly moving some services off th...The OSUOSL cluster was setup around 2014-2015, close to 10 years. They have been reliable and very cost-effective, thanks to OSUOSL. However, for performance and for planning the future, we should start slowly moving some services off those machines. Notably, when we had incidents upgrading some VMs, restoring MySQL backups was extremely slow.
The following VMs are on OSUOSL:
- lab.civicrm.org
- chat.civicrm.org
- www-prod.civicrm.osuosl.org (mostly docs, and some services, such as community-messages)
- latest.civicrm.org (stats/pingbacks)
- test-2.civicrm.org (jenkins node)
Less critical:
- manage.civicrm.osuosl.org (used mostly as an Ansible ProxyJump host, but some servers might still be configured to use LDAP)
Not used / mostly shutown:
- cxnapp-2
- www-cxn-2
We currently have VMs also at Linode (www-prod-2) and OVH (paella and test-3). The `test-3` server is dedicated to running tests, but paella runs these services:
- botdylan.civicrm.org (github bot),
- test-1.civicrm.org
- www-demo (for sandbox sites)
- www-test (not used, was for hosting a test site for civicrm.org)
Paella has around 230 GB of free disk space.
Internal reference: https://chat.civicrm.org/civicrm/pl/ocbbd61xsbg9xphwrjk4eq8wswhttps://lab.civicrm.org/infra/ops/-/issues/1010extdir: upgrade to a more recent PHP version (ideally 8.0 or later)2023-10-25T17:37:22Zbgmextdir: upgrade to a more recent PHP version (ideally 8.0 or later)Currently runs on PHP 7.2.
civicrm.org runs on PHP 8.0.
Repo: https://lab.civicrm.org/infrastructure/extdir.gitCurrently runs on PHP 7.2.
civicrm.org runs on PHP 8.0.
Repo: https://lab.civicrm.org/infrastructure/extdir.githttps://lab.civicrm.org/infra/ops/-/issues/1009Extension directory times out when uncached2023-11-01T14:26:49ZJonGoldExtension directory times out when uncachedCivi's Guzzle connections time out after the number of seconds specified in the `http_timeout` setting. Which, by default, is 5 seconds.
As the extension directory has grown, the amount of time needed to generate the results of, say, `...Civi's Guzzle connections time out after the number of seconds specified in the `http_timeout` setting. Which, by default, is 5 seconds.
As the extension directory has grown, the amount of time needed to generate the results of, say, `https://civicrm.org/extdir/ver=5.68.alpha1|uf=Drupal|status=stable|ready=` now exceeds 5 seconds.
This leads to timeouts when accessing the directory - difficult to troubleshoot because the next time, the cached results are returned quickly.
I thought I'd raise this as an infra issue first. If reducing the time needed to generate the results isn't possible, I'll submit a PR to raise the connection timeout specifically when loading the extension directory.https://lab.civicrm.org/infra/stats-collection/-/issues/16Fetching data from Stack Exchange is broken2023-10-10T15:08:37ZbgmFetching data from Stack Exchange is brokenBreaks after 25 pages:
```
(
[error_id] => 403
[error_message] => page above 25 requires access token or app key
[error_name] => access_denied
)
```Breaks after 25 pages:
```
(
[error_id] => 403
[error_message] => page above 25 requires access token or app key
[error_name] => access_denied
)
```https://lab.civicrm.org/infra/ops/-/issues/1008promtail: ansible config for nginx logs on Gitlab2023-09-29T15:57:48Zbgmpromtail: ansible config for nginx logs on GitlabI updated the promtail config on most webservers, but need to add some adjustment in Ansible so that promtail can ingest this file on lab.c.o : `/var/opt/gitlab/nginx/logs/gitlab_error.log`
(I had mostly used a template that was designe...I updated the promtail config on most webservers, but need to add some adjustment in Ansible so that promtail can ingest this file on lab.c.o : `/var/opt/gitlab/nginx/logs/gitlab_error.log`
(I had mostly used a template that was designed for standard Debian servers, where nginx logs are in `/var/log/nginx/access.log`)
and latest.civicrm.org logs are missing toobgmbgmhttps://lab.civicrm.org/infra/ops/-/issues/1007Move civicrm extension extraction to Gitlab Pipeline2023-09-24T18:54:22ZbgmMove civicrm extension extraction to Gitlab PipelineI find the current Jenkins job to update extensions on Transifex inefficient to do on a daily basis:
- it fetches all extensions in the directory (filters by those available in-app)
- looks for new releases
- extracts and updates Transi...I find the current Jenkins job to update extensions on Transifex inefficient to do on a daily basis:
- it fetches all extensions in the directory (filters by those available in-app)
- looks for new releases
- extracts and updates Transifex
Eventually it does other things, like fetch Transifex translations, commit to repo, build the mo files. Those things make sense on a daily basis.
Sometimes we want to re-run the extraction on a single extension. For example, recently we had an issue with the gdpr extension, because the maintainers are mixing `vX.Y` and `X.Y` tags.
I did a test to move the "update transifex" process to a Gitlab Pipeline. Personally, I like that Gitlab uses Docker to manage the environment, so it's more clear/self-documented how the job is setup.
The proof of concept can be seen here:
https://lab.civicrm.org/dev/translation/-/blob/master/.gitlab-ci.yml
Note: the config is split in two tasks, so that for testing we can more easily run only one or the other (extract / commit). We can merge them once it's well-tested.
What's missing:
- [x] Configure the Transifex token in the Gitlab CI/CD settings of the project
- [x] Configure a Github token so that the pipeline can commit (personal token added to my `mlutfy-civicrm` account)
- [ ] Setup a Gitlab webhook so that we can trigger the pipeline for new releases
- [ ] Call the webhook when new releases are published on civicrm.org (extdir module, `modules/custom/extdir/extdir.drush.inc`)https://lab.civicrm.org/infra/ops/-/issues/1006chat.civicrm.org: upgrade to Ubuntu 20.042023-09-20T18:34:28Zbgmchat.civicrm.org: upgrade to Ubuntu 20.04it's currently on 18.04 :grimacing:
and blocks #1005it's currently on 18.04 :grimacing:
and blocks #1005https://lab.civicrm.org/infra/ops/-/issues/1005Migrate all backups from rdiff-backup to borg/borgmatic2024-01-15T01:38:58ZbgmMigrate all backups from rdiff-backup to borg/borgmaticPriority:
- [x] #970 lab.civicrm.org - runs Ubuntu 22.04
- [x] www-prod.civicrm.osuosl.org - runs Ubuntu 22.04
- [x] latest.civicrm.org - runs Ubuntu 20.04
- [x] chat.civicrm.org - runs Ubuntu 20.04
- [x] spark-1.civicrm.org - needs upg...Priority:
- [x] #970 lab.civicrm.org - runs Ubuntu 22.04
- [x] www-prod.civicrm.osuosl.org - runs Ubuntu 22.04
- [x] latest.civicrm.org - runs Ubuntu 20.04
- [x] chat.civicrm.org - runs Ubuntu 20.04
- [x] spark-1.civicrm.org - needs upgrade to ~~Debian Bullsye~~ (done), then Debian Bookworm
- [x] spark-2.civicrm.org - needs upgrade to ~~Debian Bullsye~~ (done), then Debian Bookworm (spark-2 was already running borg, because we backup to a EU server)
- [x] www-prod-2.civicrm.org - needs upgrade to ~~Debian Bullsye~~ (done), ~~then Debian Bookworm~~ (done)
- [x] www-prod-2: also backup to backups-1.c.o
Followed by:
- [ ] botdylan.civicrm.org - needs upgrade to Debian Bullsye, then Debian Bookworm
- [x] test.civicrm.org - needs upgrade to ~~Debian Bullsye~~ (done), then Debian Bookworm
- [ ] www-demo.civicrm.org (rdiff currently broken) - needs upgrade to Debian Bookworm
Low priority:
- [ ] backups-1.civicrm.org
- [ ] barbecue.civicrm.org
- [ ] padthai.civicrm.org
- [ ] paella.civicrm.org
- [ ] test-1.civicrm.org
- [ ] test-2.civicrm.org
- [ ] test-3.civicrm.org
These can probably be ignored:
- [x] cxnapp-2.civicrm.org (offline)
- [x] www-cxn-2.civicrm.osuosl.org (offline)
- [x] manage.civicrm.osuosl.org (not used anymore, except as a ProxyJump)
- [x] www-test.civicrm.org (offline?)
For each server:
- Verify includes/excludes
- Setup with Ansible
- Update monitoring
We have 116 GB available on sushi, 180 GB used, so presumably we will run out of space if we do them all at once, instead of waiting a bit to purge some old rdiff backups (after .. 6 months?). Although Gitlab is one of the bigger backups, and it's already cleaned up.https://lab.civicrm.org/infra/ops/-/issues/1004(Test Systems) Java out-of-memory leads to zombie worker2023-09-07T20:34:42Ztotten(Test Systems) Java out-of-memory leads to zombie worker(Originated on MM chat: https://chat.civicrm.org/civicrm/pl/otynrhfqeintdbwqpnjjakjdia. Note that it starts out with two different problems; one rc tarball problem is cleared up quickly. This issue is about the other problem. I'm taking ...(Originated on MM chat: https://chat.civicrm.org/civicrm/pl/otynrhfqeintdbwqpnjjakjdia. Note that it starts out with two different problems; one rc tarball problem is cleared up quickly. This issue is about the other problem. I'm taking the observations about it and trying to compose a full hypothesis of the problem.)
Suppose you have a job like https://test.civicrm.org/job/CiviCRM-Core-Matrix-PR/4551/BKPROF=dfl,SUITES=phpunit-crm,label=bknix-tmp/console -- the job is interrupted because one of the Java based agents (Jenkins master or Jenkins agent) runs out of memory.
```
Installing build4test_qpf3m database
ok 1806 - CRM_Dedupe_MergerTest::testBatchMergeSelectedDuplicates
ok 1807 - CRM_Dedupe_MergerTest::testBatchMergeAllDuplicates
ok 1808 - CRM_Dedupe_MergerTest::testGetCidRefs
ok 1809 - CRM_Dedupe_MergerTest::testGetMatches
ok 1810 - CRM_Dedupe_MergerTest::testGetMatchesExcludeDeleted with data set #0 (true)
ok 1811 - CRM_Dedupe_MergerTest::testGetMatchesExcludeDeleted with data set #1 (false)
ok 1812 - CRM_Dedupe_MergerTest::testGetMatchesIgnoreLocationType
ok 1813 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaMatched
ok 1814 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaMatchedWithLimit
ok 1815 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaMatchedWithSearchLimit
ok 1816 - CRM_Dedupe_MergerTest::testGetMatchesNoCriteria
ok 1817 - CRM_Dedupe_MergerTest::testGetMatchesNoCriteriaButLimit
ok 1818 - CRM_Dedupe_MergerTest::testGetMatchesCriteriaNotMatched
FATAL: command execution failed
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:102)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected reader termination
at hudson.remoting.SynchronousCommandTransport$ReaderThread.lambda$new$1(SynchronousCommandTransport.java:50)
at java.base/java.lang.Thread.dispatchUncaughtException(Thread.java:1997)
Caused: java.io.IOException: Backing channel 'test-4' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:215)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:285)
at com.sun.proxy.$Proxy74.isAlive(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1215)
at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1207)
at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:195)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:145)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:164)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
at hudson.model.Run.execute(Run.java:1900)
at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
at hudson.model.ResourceController.execute(ResourceController.java:107)
at hudson.model.Executor.run(Executor.java:449)
FATAL: Unable to delete script file /tmp/jenkins8376836365637261492.sh
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:102)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
Caused: java.io.IOException: Unexpected reader termination
at hudson.remoting.SynchronousCommandTransport$ReaderThread.lambda$new$1(SynchronousCommandTransport.java:50)
at java.base/java.lang.Thread.dispatchUncaughtException(Thread.java:1997)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@4a72872d:test-4": Remote call on test-4 failed. The channel is closing down or has closed down
at hudson.remoting.Channel.call(Channel.java:993)
at hudson.FilePath.act(FilePath.java:1186)
at hudson.FilePath.act(FilePath.java:1175)
at hudson.FilePath.delete(FilePath.java:1722)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:163)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
at hudson.model.Build$BuildExecution.build(Build.java:199)
at hudson.model.Build$BuildExecution.doRun(Build.java:164)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
at hudson.model.Run.execute(Run.java:1900)
at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
at hudson.model.ResourceController.execute(ResourceController.java:107)
at hudson.model.Executor.run(Executor.java:449)
Build step 'Execute shell' marked build as failure
ERROR: Step ‘Publish xUnit test result report’ failed: no workspace for CiviCRM-Core-Matrix-PR/BKPROF=dfl,SUITES=phpunit-crm,label=bknix-tmp #4551
Finished: FAILURE
```
Here's what happens next:
* Jenkins kills the communication channel.
* Jenkins assumes that the worker-node kills any ongoing work for the test-job.
* Jenkins establishes a new communication channel and begins running new jobs.
* **But** the worker-node did *not* kill everything. (*I'm not clear exactly what it did do -- eg if any POSIX signals were sent; eg if worker processes are running or suspended.*) For example, `mysqld` is present in the process-table, and it retains a hold on TCP port 5601.
* When Jenkins begins another job, it finds the worker-image (`/home/dispatcher/images/bknix-dfl-2.img`) is in use. In fact, all of the images are in use. So it creates a new one (`bknix-dfl-5.img`).
* When Jenkins starts using `bknix-dfl-5.img`, it tries to launch a new mysqld on TCP port 5601. But it can't; the port is conflicted. You get problems [like this](https://test.civicrm.org/job/CiviCRM-Core-Matrix-PR/4553/BKPROF=dfl,SUITES=phpunit-api4,label=bknix-tmp/console):
```
[mysql] Start daemon: mysqld --datadir="/home/homer/_bknix/ramdisk/worker-3/mysql/data"
[mysetup] Initialize folder: /home/homer/_bknix/ramdisk/worker-3/mysetup
Waiting for MySQL (maxWait=300, interval=0.5, windDown=0.5)...
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/home/homer/_bknix/ramdisk/worker-3/mysql/run/mysql.sock' (2)
```
If you have several jobs running at the moment of `OutOfMemory`, then you may wind up doing this multiple times (e.g. 3 jobs die; 3 zombies left behind; 3 new images created; 3 tcp ports blocked).
----------
> (Follow-up) *I'm not clear what the exact status is -- if any POSIX signals sent; if worker processes are running or suspended.*
What I did observe was that these zombie processes were still around ~2 hours after the original. But after ~2h30m, they had gone way on their own.
If the jobs were quietly executing in a headless fashion, then they should've wrapped up in <30min. So they're probably not running -- it suggests they're somehow suspended (and then some other timeout/reaper mechanism comes in after 2hr). But this is pure speculation.
---------
Brainstorming...
* Maybe figure out which java process ran out of memory -- and why.
* Maybe figure out what - if any - signals are being emitted when this OOM happens. Find a way to kill the zombies/orphans properly.
* Maybe introduce some firmer timeouts within the jobs. (We usually rely on Jenkins to timeout jobs; but obviously that doesn't work here. Try sprinkling `timeout` calls into `CiviCRM-Core-Matrix-PR.job` or similar leverage-point.)
* Maybe change the port-allocation function.
* Maybe enable network-namespaces for purposes for any non-interactive jobs.tottentottenhttps://lab.civicrm.org/infra/ops/-/issues/1003Investigate SPF "Soft fail"2023-09-07T01:33:13ZtottenInvestigate SPF "Soft fail"I got one of the emails from the `civicrm.org` security announcements. In Gmail, there's an option "Show original" which has an interesting report. The message makes it to my inbox (perhaps because of the history; perhaps because of the ...I got one of the emails from the `civicrm.org` security announcements. In Gmail, there's an option "Show original" which has an interesting report. The message makes it to my inbox (perhaps because of the history; perhaps because of the DKIM), but it shows a failure about SPF.
@bgm I'm not sure how we're routing mail right now. Perhaps we need some DNS tweak?
> ![Screen_Shot_2023-09-06_at_6.24.14_PM](/uploads/156f371c5e9e404b73c971535be156cf/Screen_Shot_2023-09-06_at_6.24.14_PM.png)https://lab.civicrm.org/infra/ops/-/issues/1002Migrate/integrate download.civicrm.org with civicrm.org2023-09-01T10:02:50ZtottenMigrate/integrate download.civicrm.org with civicrm.orgIn some side discussion about #1001, @colemanw and @bgm suggested migrating or integrating `download.civicrm.org` with `civicrm.org`. I wanted to record an issue to capture this.
How:
* Add a D9/D10 module on `civicrm.org` which either...In some side discussion about #1001, @colemanw and @bgm suggested migrating or integrating `download.civicrm.org` with `civicrm.org`. I wanted to record an issue to capture this.
How:
* Add a D9/D10 module on `civicrm.org` which either:
1. Migrates the PHP logic from the `download.civicrm.org`, or
2. Forwards HTTP sub-requests to `download.civicrm.org`.
Upshots:
* This lets you inherit the navigation, site-wide theming, and analytics.
* It's written with Symfony page-controllers and Twig, which are also supported by D9/D10.
* It doesn't have any interdependencies on Drupal content ("nodes" and "files"), so it should be fairly easy to install/maintain such a module on a local dev-site.
There are a few things to bear in mind:
* `download.civicrm.org` has a few areas of functionality: autobuild info (eg https://download.civicrm.org/latest/), redirects (eg `https://download.civicrm.org/civicrm-X.Y.Z-foo.tar.gz`), and release info (eg https://download.civicrm.org/about/). Each has a few subpages/features.
* Its basic purpose is to list/filter/cache information about the available builds (from Google Cloud Storage). It blends in some additional data from (1) release-notes in Github and (2) JSON files provided by each build.
* From the POV of a general reader on `civicrm.org`, some functionality (like "inspecting the git input used by a candidate build") is niche. But it's still useful for release-management. Migrating/integrating means you may have to reconcile more opinions about what to present.
* It's not currently designed around composable/mixable `block`s. It's just a couple HTML pages. But in Drupal, in the long-run, it probably makes sense to do more of the `block` stuff.