ops issueshttps://lab.civicrm.org/infra/ops/-/issues2020-04-03T01:38:14Zhttps://lab.civicrm.org/infra/ops/-/issues/931PR testing should validate ts() strings2020-04-03T01:38:14ZbgmPR testing should validate ts() stringsPR review sometimes catches incorrect uses of `ts`, but not always. The gettext extraction scripts are pretty good at catching many invalid use-cases.
It would be nice to have to run string-extraction on pull-requests. Technically, it s...PR review sometimes catches incorrect uses of `ts`, but not always. The gettext extraction scripts are pretty good at catching many invalid use-cases.
It would be nice to have to run string-extraction on pull-requests. Technically, it should be a quick check to add, similar to checking code syntax.
One issue we have, is that there are many 20-25 errors that are systematically thrown by the string-extractor. Some are annoying to fix issues, others are incorrect uses of 'ts' that are difficult to workaround.
It would be really useful to have a way to flag tolerated or known-issues, so that we can at least start applying some checks moving forward.
Worst case, it could be a list of regexs of code to ignore?
cc @seamuslee @totten @davedhttps://lab.civicrm.org/infra/ops/-/issues/904Proactively restart mysqld on test nodes2019-07-10T01:43:11ZtottenProactively restart mysqld on test nodes__Issue__: MySQL periodically crashes, and we have not been able to find a concrete reason in the logs. It appears to happen most on `test-1` (which also handles the most test runs).
__Proposed Intervention__: Periodically, proactively ...__Issue__: MySQL periodically crashes, and we have not been able to find a concrete reason in the logs. It appears to happen most on `test-1` (which also handles the most test runs).
__Proposed Intervention__: Periodically, proactively restart mysqld.
You could easily add Jenkins job which just restarts the daemon; however, the challenge is that there may be some mix of concurrent jobs which are actively using the mysqld. You need to wait for (or create) an opportunity to restart the daemon.
[flock](https://linux.die.net/man/1/flock) seems like it might do the job, as in:
1. Pick a naming convention for a lock file (e.g. `~/bknix-dfl/var/mysql-admin-lock`)
2. At the start of every test job (`CiviCRM-Core-PR`, `CiviCRM-Core-Matrix`, etc), wrap all the work in a call to `flock` which acquires a *shared/read lock*.
3. In some cleanup job (eg `CiviCRM-PR-Cleanup`), wrap the `mysqld restart` work in an *exclusive/write lock*.
Alternatively, https://plugins.jenkins.io/build-blocker-plugin might do the job.https://lab.civicrm.org/infra/ops/-/issues/855c-i: Add in monitoring of MySQL Queries2018-10-16T01:59:44Zseamusleec-i: Add in monitoring of MySQL Queries@bgm @totten
I think we should probably put in some monitoring of long queries on the 3 MySQL instances if possible. I think we should probably alert if a query is running > 300s. I would find it strange on any of our test jobs that a ...@bgm @totten
I think we should probably put in some monitoring of long queries on the 3 MySQL instances if possible. I think we should probably alert if a query is running > 300s. I would find it strange on any of our test jobs that a query would be running for longer than 300shttps://lab.civicrm.org/infra/ops/-/issues/843padthai: faulty disk is causing performance issues2019-01-03T17:08:54Zbgmpadthai: faulty disk is causing performance issuesCurrrent status:
* [x] Replace all 3 disks in padthai.c.o
* [x] Re-install padthai OS from scratch
* [x] Configure ~~test-ubu1204-5.c.o~~ test-1.c.o
* [x] (moved to #863) Configure ~~test-ubu1604-1.c.o~~ test-2.c.o
* [x] Restore botdyla...Currrent status:
* [x] Replace all 3 disks in padthai.c.o
* [x] Re-install padthai OS from scratch
* [x] Configure ~~test-ubu1204-5.c.o~~ test-1.c.o
* [x] (moved to #863) Configure ~~test-ubu1604-1.c.o~~ test-2.c.o
* [x] Restore botdylan.c.o
* [x] Restore www-test.c.o
* [ ] Ensure that backups are correctly configured on padthai/test-1/test-2, and that relevant files are backed up (i.e. anything other than `/etc`).
Initial ticket: What seems to be a faulty disk is causing performance issues.
```
root@padthai:~# zpool status
pool: zpadthai
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zpadthai ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda5 ONLINE 0 0 0
sdb5 ONLINE 0 0 0
sdc5 ONLINE 3 0 0
errors: No known data errors
```
```
# dmesg | grep sd
[54960868.851963] sd 0:0:2:0: [sdc] tag#3 CDB: Read(10) 28 00 1d 3a f4 90 00 00 f8 00
[54960868.859534] sd 0:0:2:0: [sdc] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[54960868.860477] sd 0:0:2:0: [sdc] tag#3 Sense Key : Medium Error [current]
[54960868.861381] sd 0:0:2:0: [sdc] tag#3 Add. Sense: Unrecovered read error
[54960868.862272] sd 0:0:2:0: [sdc] tag#3 CDB: Read(10) 28 00 1d 3a f4 90 00 00 f8 00
[54960868.863147] blk_update_request: critical medium error, dev sdc, sector 490402960
[54962799.895947] sd 0:0:2:0: [sdc] tag#1 CDB: Read(10) 28 00 0b 6a ec 40 00 00 18 00
[54962799.903136] sd 0:0:2:0: [sdc] tag#0 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[54962799.904018] sd 0:0:2:0: [sdc] tag#0 CDB: Read(10) 28 00 0b 6a f3 90 00 00 08 00
[54962799.904884] blk_update_request: I/O error, dev sdc, sector 191558544
[54962799.905730] sd 0:0:2:0: [sdc] tag#2 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[54962799.906568] sd 0:0:2:0: [sdc] tag#2 CDB: Read(10) 28 00 0b 6a f9 50 00 00 10 00
[54962799.907392] blk_update_request: I/O error, dev sdc, sector 191560016
[54962799.908224] sd 0:0:2:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[54962799.909032] sd 0:0:2:0: [sdc] tag#1 Sense Key : Medium Error [current]
[54962799.909828] sd 0:0:2:0: [sdc] tag#1 Add. Sense: Unrecovered read error
[54962799.910607] sd 0:0:2:0: [sdc] tag#1 CDB: Read(10) 28 00 0b 6a ec 40 00 00 18 00
[54962799.911373] blk_update_request: critical medium error, dev sdc, sector 191556672
```