Product-maintenance issueshttps://lab.civicrm.org/eileen/Product-maintenance/-/issues2019-04-25T21:48:38Zhttps://lab.civicrm.org/eileen/Product-maintenance/-/issues/22Changing LEXIM to LEXI2M - from one month to two month cycle2019-04-25T21:48:38ZdtarrantChanging LEXIM to LEXI2M - from one month to two month cycleMoving this from civi-dev list. A response to discussion there about work required for monthly release and mid month security releases when necessary.
This topic and the referenced additional effort for core team to manage the cycle o...Moving this from civi-dev list. A response to discussion there about work required for monthly release and mid month security releases when necessary.
This topic and the referenced additional effort for core team to manage the cycle of releases with mid month security, etc. gives me the opportunity to request discussion and consideration of the monthly release cycle.
I will posit that LEXIM is based on a somewhat arbitrary time cycle. It moved Civi from a much longer and variable cycle time (six months or so) to monthly cycle for all reasons previously identified when LEXIM was implemented.
My question is whether, on balance, all (including core team) might be better served by say a 2 month cycle - "LEXI2M"?
Obvious trade-off is delay between incorporation of new features, bug fixes, etc. against the core team and other volunteer time needed to turn around each revision including time for creating and testing new versions by all.
Is it time to revisit the timing of this cycle? Just askin'
Comments? (Should this conversation be held elsewhere?)https://lab.civicrm.org/eileen/Product-maintenance/-/issues/20Build list of people with knowledge in particular areas2018-10-11T10:58:50ZeileenBuild list of people with knowledge in particular areasWe should build a more shared list of people with skills / code interests in different areas for the purposes of broadening our list on who to ask for input on on gitlab issues / github prs.
Deduping
- Eileen
- John Kingsnorth
- Deepak
...We should build a more shared list of people with skills / code interests in different areas for the purposes of broadening our list on who to ask for input on on gitlab issues / github prs.
Deduping
- Eileen
- John Kingsnorth
- Deepak
Civi-SMS
- Andy Burns
- John Kingsnorth
Angular
- Tim
- Coleman
- Seamus
Joomla
- Nicw / vingle
- Brian Shaunessey
PCPs
- Karin Ghttps://lab.civicrm.org/eileen/Product-maintenance/-/issues/19Regression analysis - 5.4 (fixed 5.5.2) CiviCase bug on new installs2018-09-22T23:21:28ZeileenRegression analysis - 5.4 (fixed 5.5.2) CiviCase bug on new installs**What happened **
An enhancement was made to CiviCase that required a new option value. The option value was added in the upgrade script but was missed for new installs.
**How & When was it addressed**
It was identified in stack excha...**What happened **
An enhancement was made to CiviCase that required a new option value. The option value was added in the upgrade script but was missed for new installs.
**How & When was it addressed**
It was identified in stack exchange but as it only affected new installs it was not immediately understood as an issue. It was fixed quickly once confirmed as an issue.
**How did the regression fit with our processes?**
The PR had been reviewed but this was missed
**Recommendations going forwards**
I think in this case we need to learn from the specific nature of the issue.https://lab.civicrm.org/eileen/Product-maintenance/-/issues/18Regression analysis - 5.5 (fixed 5.5.1) broken smart group on permissioned re...2018-09-22T23:17:49ZeileenRegression analysis - 5.5 (fixed 5.5.1) broken smart group on permissioned relationship**What happened**
An enhancement was made to CiviCRM to allow relationship permissions to be no/read only/edit instead of just yes/no. This was fixed in the search form but resulted in a breakage in pre-existing smart groups.
**How & Wh...**What happened**
An enhancement was made to CiviCRM to allow relationship permissions to be no/read only/edit instead of just yes/no. This was fixed in the search form but resulted in a breakage in pre-existing smart groups.
**How & When was it addressed**
It was found immediately after 5.5.0 was released (by Noah Miller) and fixed within 24 hours.
**How did the regression fit with our processes?**
The PR had been reviewed by multiple reviewers but this was not something someone thought to test
**Recommendations going forwards**
I think in this case we need to learn from the specific nature of the issue - ie. changes to search forms can affect smart groups and ensure developers are aware of this instance and bear in in mind
https://lab.civicrm.org/eileen/Product-maintenance/-/issues/17Delay Analysis (5.5.0)2018-09-23T19:41:17ZtottenDelay Analysis (5.5.0)The 5.5.0 release was scheduled for publication on Sept 5, 2018, but the release was delayed until Sept 11, 2018. This issue aims to report back on the delays -- and to discuss improvements/mitigations going forward.
## Key details
* ...The 5.5.0 release was scheduled for publication on Sept 5, 2018, but the release was delayed until Sept 11, 2018. This issue aims to report back on the delays -- and to discuss improvements/mitigations going forward.
## Key details
* Publication is heavily dependent on automation -- in a typical release cycle, we rely on automated systems to prepare draft tarballs; to run the main test suite in multiple environments; to run combinatorial tests with those environments and various extensions; to prepare demo sites where we can try out the installation routine from a user-level perspective; to categorize pending PRs; to test PRs; etc. These systems provide important signals that we always check during publication. Automating them makes the release process faster. In the happiest case, automation means that we don't work as hard on publication -- but often it just means we have capacity left to investigate false-negatives and other late-stage issues (e.g. regressions reported during RC period).
* Most of the automated tasks were run through a distributed job manager (Jenkins), but they were coupled to particular servers. For example:
* Nightly tests of Civi ran against a minimal environment (eg PHP 5.5, MySQL 5.5) on the VM `test-ubu1204-1` (on a physical server-cluster named `ganeti`).
* Nightly tests of Civi ran against a newer environment (eg PHP 7.0, MySQL 5.7) on the VM `test-ubu1604-1` (on a physical server named `padthai`).
* PR tests and tarball builds ran on an in-between environment (e.g. PHP 5.5 and MySQL 5.7) on the VM `test-ubu1204-5` (on a physical server named `padthai`).
* Bots dealing with Git/Github administration ran on the VM `botdylan` (on a physical host `padthai`).
* Initially, the servers were scoped to be minimalist instances of the stock Ubuntu configurations. These particular VMs were generally designed to be ephemeral/replaceable, and they generate a tremendous amount of short-term data, so very little backup was provided. Some important parts of the configuration were put into automated form (ansible/bash) and some parts were handled manually. Of course, you'll be unsurprised to hear that (over time) the untracked manual parts grew. The conceptual goal of tracking stock was abandoned; OS's were upgraded; PPAs and other non-standard binaries were added; etc. Each step individually represented a calculated cost-benefit -- but the overall result was an accumulation of technical debt (in the form of manual configuration).
* During the week before the release, the operations took two hits: firstly, I took a few days vacation before release day (which reduced our admin capacity). Concurrently, the physical server (`padthai`) had a RAID failure that took down several VMs (`test-ubu1204-5`, `test-ubu1604-1`, `botdylan`, `www-demo`). bgm worked with the hosting provider to get the base `padthai` back online during the days before release, but there was too much debt to get everything back quickly. Net result: when I returned on release day, our QA signals were offline, and we had a long way to go in restoring them. It ultimately took us several more days' work to get the release-critical parts back.
* Addressing this was a team effort -- with Mathieu working on the base system ops and `www-test` while Seamus investigated test-failures and compatibility problems between the code&new environment. I worked on the setup scripts/jobs for worker-nodes and eventually the regular release workflow. And Eileen offered her sage support throughout. I can't speak for how much time went in overall -- but, for my part, this added a few 10 hour days to the process.
## Analysis
* Was it *completely necessary* to delay and to pay those costs? Consider the responses we could take on release day:
* Ship while blind (without our normal QA signals). This would meet the announced schedule, but it could blow up in our face if an obvious problem made it through.
* Fallback to doing more release tasks by hand. From CYA perspective, this would be safer than shipping blind; and it would have provided a more timely release date. But the labor involved would still be nontrivial, and we'd be no better off when the next release (5.5.1 or 5.6.0) came along.
* Re-commit to automating the infrastructure. This would delay 5.5.0 longer, but it would leave us in a better position for publishing the next release and for making the test infra more robust. This is ultimately the path I went down for a couple reasons:
* The x.x.0 release is not a security-release or a regression patch-release. Letting the schedule slip a few days means missing a benchmark -- but it's not gonna kill anyone.
* Once x.x.0 goes out, there's often a follow-up x.x.1, but it's hard to say when it'll happen. I *really* don't want to be in a position where we have a regression while our QA signals are all dark. This turned out to be a fortunate calculation -- because we had a 5.5.1 regression-fix on *the very next day*. (Cheers to Noah!) The regression release went out smoothly.
* I find myself somewhat ambivalent about the accumulation of technical debt which produced this situation. IIRC, we discussed the manual quirks in real-time but didn't see an affordable alternative in the moment. So, naturally, we acquired a little technical debt that ultimately had to be re-paid under-the-gun. I suspect it's a familiar story for anyone reading this.
## Remediation
The up-shot of the delay is that we were forced to pay down some technical debt -- i.e. improving the scripted definitions of the worker-nodes. The revised definitions:
* Include multiple versions of PHP and MySQL *on the same host*. Instead of splitting jobs among three qualitatively different VMs (`test-ubu1204-1`, `test-ubu1204-5`, `test-ubu1604-1`) to handle different binaries/versions, one VM (`test-1`) can handle different mixes of PHP/MySQL. This means that nodes are more interchangeable, and it's easier (than before) to reproduce/scale-up/scale-down.
* Can be installed on OS X and most Linux variants with a couple short commands without interfering with the main OS services/packages. (For the most part, the scripts originate as my local dev env -- where they needed to coexist with other stacks.)
See also: https://github.com/civicrm/civicrm-infra/blob/master/continuous-integration/worker-gen-3.mdhttps://lab.civicrm.org/eileen/Product-maintenance/-/issues/16Regression Analysis from 5.2, Sending SMSes from the contact actions failed t...2018-08-28T23:41:29ZseamusleeRegression Analysis from 5.2, Sending SMSes from the contact actions failed to send**What happened?**
A refacot of the SMS Sending code [PR](https://github.com/civicrm/civicrm-core/pull/10946) [CRM-21037](https://issues.civicrm.org/jira/browse/CRM-21037) cased SMSes to not be able to send in single one off when the se...**What happened?**
A refacot of the SMS Sending code [PR](https://github.com/civicrm/civicrm-core/pull/10946) [CRM-21037](https://issues.civicrm.org/jira/browse/CRM-21037) cased SMSes to not be able to send in single one off when the send an SMS action was selected when viewing a contact record.
**How & When was it addressed**
The bug was introduced when a variable in this case `$doNotSms` was changed from an empty string to be TRUE. The variable was used in an if statement and previously an empty string was not resolving to be TRUE. However other uses of that variable before the if suggested it was meant to be a boolean variable. It was [identified](https://lab.civicrm.org/dev/core/issues/273) approximately around the time 5.3.1 was released, even tho the bug was introduced into 5.2.0. It will be addressed in either 5.5 or 5.4.1
**How did the regression fit with our processes?**
It fitted into our workflows because this was some code cleanup done in an effort to add unit tests to better guard against such regressions occurring in our system.
**Recommendations going forwards**
We need to be more careful when handing refactor, Refactor is the one time when its easy to miss something or not fully understand all the implications of a change. In this case it was making the variable into what it seemed to be a boolean that then caused issue. It would also be noted that when doing refactors if the code becomes testable, tests should be added to cover a majority of scenarios. In this case there were two missing scenarios which included when a phone_type_id is passed in as part of the contact details which is what happens when you do a single SMShttps://lab.civicrm.org/eileen/Product-maintenance/-/issues/15Regression analysis 5.3.1 (and 4.6.38?) (fixed 5.3.2) Wordpress merge screen...2018-08-02T00:02:37ZeileenRegression analysis 5.3.1 (and 4.6.38?) (fixed 5.3.2) Wordpress merge screen urls broken**What happened?**
[A security fix](https://github.com/civicrm/civicrm-core/commit/095b12d6#diff-68d9e076998c7fbe8c7e6a89c64322e5R52) to prevent Cross browser scripting attacks was overzealous and url-encoded already encoded urls. This ...**What happened?**
[A security fix](https://github.com/civicrm/civicrm-core/commit/095b12d6#diff-68d9e076998c7fbe8c7e6a89c64322e5R52) to prevent Cross browser scripting attacks was overzealous and url-encoded already encoded urls. This still worked on Drupal but not on Joomla! & Wordpress
**How & When was it addressed **
The bug was released in 5.3.1. It was [identified a week later](https://lab.civicrm.org/dev/core/issues/279) and a release with the fix was issued within 24 hours.
**How did the regression fit with our processes?**
Security fixes are notoriously risky as they affect all sorts of flows that are hard to test and as of this month they are no longer going through the rc process. It's hard to see how we can avoid errors that affect some flows without markedly more resources. A quick fix to any issues is probably the correct and sensible process.
**Recommendations going forwards**
Our goal is to move to greater & greater precautionary escaping & preferably move to opt-out escaping in Smarty (although possibly not until the 4.6 LTS has expired at the end of the year in order to keep this work manageable). Ensuring ALL urls are handled the same at the php layer will help substantially here. We should deprecate passing $queryParams as a string to the CRM_System_Url function. We should also come up with a tpl marker like escape|none to indicate where we know we have done this5.3.2https://lab.civicrm.org/eileen/Product-maintenance/-/issues/14Regression analysis 5.3.0 (fixed 5.3.2) - Duplicate Contact Checking with aja...2018-07-26T03:09:08ZeileenRegression analysis 5.3.0 (fixed 5.3.2) - Duplicate Contact Checking with ajax off**What happened?**
An [improvement to the automatic Duplicate Checking feature](https://github.com/civicrm/civicrm-core/pull/10341) of the “New Contact” form gave better results when entering data, but caused the “Check for Matching Con...**What happened?**
An [improvement to the automatic Duplicate Checking feature](https://github.com/civicrm/civicrm-core/pull/10341) of the “New Contact” form gave better results when entering data, but caused the “Check for Matching Contacts” button to stop working when manually clicked.
**How & When was it addressed**
The bug was released in 5.3. It was fixed in [5.3.2 and 5.4.0](https://github.com/civicrm/civicrm-core/pull/12552), by the original author, about 24 hours after being reported on [stack exchange](https://civicrm.stackexchange.com/questions/25834/bug-with-check-for-matching-contacts-button) by Robin Fenwick. Robin and Seamus Lee both diligently followed up to test and merge the fix. 5.3.2 was issued within 72 hours of the bug being found
**How did the regression fit with our processes?**
The original improvement was tested by a number of people. However, they did not test all workflows. Our focus was on getting the new feature working correctly and there were a number of hurdles, notably that the default “Supervised” deduping rule is not well-suited for checking matches during data entry. This is also an old area of the codebase which does not contain any unit tests.
The additional (less used & less tested) workflow involved a setting in non-default mode. Developers are often keen to introduce settings as a way to get something into core that not everyone will want but they reduce the robustness of our product maintenance.
The review time taken from raising the review to getting it merged into core was very long on the original PR - this generally increases risk.
**Recommendations going forwards**
1. Identify when a PR touches code that does not have test coverage, and require extra vigilance in QA, either in the form of new tests, or in complete manual testing.
2. In this case, writing a unit test was not practical because there is no testing framework in place for the CiviCRM front-end. This deficit still needs to be addressed at some point but was out-of-scope for such a small change.
3. Try to prevent more setting coming into core as they create more complexity and often are hard to test.
4. Keep the PR queue manageable (including closing non-reviewable PRs) to reduce drawn out PR work and adopting a 'finishing oriented' approach - ie. try to finish more work & start less5.3.2https://lab.civicrm.org/eileen/Product-maintenance/-/issues/13Regression analysis 5.2.0 (fixed 5.3.2) Affects front event event registratio...2018-07-27T23:33:30ZeileenRegression analysis 5.2.0 (fixed 5.3.2) Affects front event event registrations with personal campaign pages configured.**What happened**
A [bug fix submitted by reviewers](https://github.com/civicrm/civicrm-core/pull/12056) to replace [a patch submitted that did not meet our coding expectations](https://github.com/civicrm/civicrm-core/pull/12044) introd...**What happened**
A [bug fix submitted by reviewers](https://github.com/civicrm/civicrm-core/pull/12056) to replace [a patch submitted that did not meet our coding expectations](https://github.com/civicrm/civicrm-core/pull/12044) introduced a fatal error when submitting event pages configured with PCPs. The original submitter had chosen not to test the final patch and neither myself nor Monish were very familiar with the code area and it appears we both tested (did r-run) on the contribution page flow & missed the events flow, not picking it up as a separate flow.
**How and when was it addressed.**
The bug was released in 5.2.0. It was not detected until 5.3.1 was released 6 weeks later (by the submitter of the original patch). Fixes will be in 5.3.2 and in 5.4 within 4 days of [the bug report](https://lab.civicrm.org/dev/core/issues/272) (the patch was written by Monish and tested by Karin G and merged with 24-48 hours but as this was not detected until 6 weeks after it was released we did not rush the release out.)
**How did the regression fit with our processes?**
We pursued the bug because it fit a broad definition of critical - ie. any fatal error. However, after dealing with this particular bug I altered the definition of critical, [at least as it affects the product maintenance team](https://lab.civicrm.org/eileen/Product-maintenance), to only consider long-standing fatal bugs as critical if the reporter was also engaged in following through to completion. So, our process was already somewhat changed in that had the same thing occurred now we would not have prioritised fixing the original bug without the reporter agreeing to test.
**Context**
A [patch was submitted](https://github.com/civicrm/civicrm-core/pull/12044) to fix an error hit when the field data entered for the note on a pcp field exceeded the 255 characters. This was presented as a critical bug although it was long standing (at least 5 years) and we were not aware of other reports.
Monish and I provided feedback requesting the fix be changed to match our preferred approach (we have been adopting addField functionality when we have to update fields for things like this since around 4.5 and in where ever possible ask for metadata based approaches to fixes as a general rule). The submitter indicated they was not going to make the changes and advised “If this does not meet standards feel free to close;”
As the bug was critical by our definition at the time, and I had already sunk in work to try to make the fix easier for the submitter by doing a metadata change](https://github.com/civicrm/civicrm-core/pull/12046), I proceeded to [complete the bug fix and put up a PR](https://github.com/civicrm/civicrm-core/pull/12056). I then asked the submitter to test it. She advised she was not going to. I gave a lot of consideration at the time to closing out the fix unmerged despite the sunk time, but then Monish stepped up & tested and merged it. Neither of us were familiar with the flow and it appears we both focussed on the Contribution Page flow not event page.
**Recommendations going forwards**
1. I think the change already made to not treating long-standing bugs as critical unless the reporter is engaged in the full process would have prevented this. (we would have closed the PR & tracked the issue in gitlab until such time as someone raised a review-ready PR).
1. We should consider changing our terminology from ‘pull request’ to ‘review request’ to make it clearer to people they are are requesting someone spend time on reviewing their ideas code or proposals.
1. We should put together more documentation explaining the ways in which we are working to improve our code base so we have something to point people to. (This is already a lower priority item on my very long to-do list).
1. We should find out who is familiar enough with using PCPs (and willing) to test future PCP related patches (and to validate bug reports)5.3.2https://lab.civicrm.org/eileen/Product-maintenance/-/issues/12Regression analysis 5.2.0 (fixed 5.2.2)2018-07-26T00:09:34ZeileenRegression analysis 5.2.0 (fixed 5.2.2)**What happened?**
A [pro-bono bug fix](https://lab.civicrm.org/dev/core/issues/60) to help Webform users add payments to events resulted in part of the back-office event registration screen not loading for Wordpress users.
**How & Wh...**What happened?**
A [pro-bono bug fix](https://lab.civicrm.org/dev/core/issues/60) to help Webform users add payments to events resulted in part of the back-office event registration screen not loading for Wordpress users.
**How & When was it addressed**
Patch release 5.2.2 released within 48 hours on the bug being logged in gitlab. Pradeep gets credit for analysing the bug and submitting the fix after doing some pretty careful analysis of a confusing stack exchange report. Kevin tested the fix for us (very promptly)
**How did the regression fit with our processes?**
The original fix was tested by multiple people and it appeared to work. All those people were on drupal but there was nothing about the fix that implied there would be any difference for drupal vs another CMS. However, it turned out that drupal was in some way [‘swallowing’ the errors’](https://github.com/civicrm/civicrm-core/pull/12334).
The bug was fixed within 48 hours of being logged in github. It was asked [first in stack exchange](https://civicrm.stackexchange.com/questions/25394/event-fees-not-loading-when-registering-in-back-end) and Pradeep picked it up and submitted a fix. There was an earlier confusing question [logged in stackexchange](https://civicrm.stackexchange.com/questions/25376/cant-submit-credit-card-payment-after-update-to-5-2-1-joomla) which was probably the same bug.
**Context**
This bug was registered first on stack exchange - but there is considerable confusion (still) on how tightly [the stack exchange issue](https://civicrm.stackexchange.com/questions/25376/cant-submit-credit-card-payment-after-update-to-5-2-1-joomla) maps to the bug. A lot of credit is due to Pradeep for determining that there was a bug in the description there and figuring out how to replicate it. I looked briefly at that issue as it had 5.2.1 in the description but did not figure out that there was a replicable bug in there.
**Recommendations going forwards**
1. Figure out how to make our buildkit drupal environments less forgiving. The goal should be that dev environments are the least forgiving environments and production the most and this analysis identifies an instance where that was not the case
2. Think about how we can more quickly identify stack exchange issues. In this case it was Pradeep’s effort that got us there. (We can’t realistically stretch the existing product maintenance team to spend more time on Stack Exchange).
**Update**
Buildkit now has stricter error checking for drupal by default.https://lab.civicrm.org/eileen/Product-maintenance/-/issues/11Regression analysis 5.2.0 (fixed 5.2.1)2018-07-26T00:09:25ZeileenRegression analysis 5.2.0 (fixed 5.2.1)**What happened?**
A patch to make Contribution Detail report compliant with FULL_GROUP_BY mysql mode causes the display of one field (soft_credits) to break.
https://lab.civicrm.org/dev/core/issues/170
**How & When was it addressed**
...**What happened?**
A patch to make Contribution Detail report compliant with FULL_GROUP_BY mysql mode causes the display of one field (soft_credits) to break.
https://lab.civicrm.org/dev/core/issues/170
**How & When was it addressed**
Patch release 5.2.1 released within 48 hours.
**How did the regression fit with our processes?**
The issue was handled in conjunction with our processes at the time and it’s not obvious that someone should have thought to test that field. However, shortly after that fix was made we stopped changing working queries to comply with full group by mode (due to a history of regressions) and started to by-pass full group by where it was known not to work. This & last month’s issue should be the tail end of the FULL_GROUP_BY issues.
This report has known testability problems and we started to address that last month & with the fix for this regression we have a path to ensure any further fixes on this (known to be problematic) report get tests.
**Context**
Unit testing on reports is relatively new and coverage is relatively poor and for some reports issues with how they are constructed has been a blocker. This report is notable on that front and it would have to be described as fairly toxic. In this context I am aiming for a fix that is well tested and solves the unit testing issue going into master. https://github.com/civicrm/civicrm-core/pull/12281
**Recommendations going forwards**
Realistically the steps we have already taken - ie. getting this known-to-be-an-issue report under tests and be less aggressive on full group by fixes.https://lab.civicrm.org/eileen/Product-maintenance/-/issues/10Regression analysis - 5.1.0 (fixed 5.1.1)2018-07-26T00:09:18ZeileenRegression analysis - 5.1.0 (fixed 5.1.1)**What happened?**
A patch release was created for 5.1 (5.1.1) that included a fix for a regression that occurred relating to FULL_GROUP_BY mode for mysql. The fix did not include a pre-requisite patch that was already in the rc and in ...**What happened?**
A patch release was created for 5.1 (5.1.1) that included a fix for a regression that occurred relating to FULL_GROUP_BY mode for mysql. The fix did not include a pre-requisite patch that was already in the rc and in master, and hence it caused a more serious, wide-ranging regression
**How & When was it addressed**
The new regression was fixed & a new patch release was issued within 24 hours
**How did the regression fit with our processes?**
Our processes say that we should issue patch releases to address any regressions that have occurred in the current release (so that we are raising the bar each release). Technically the regression we were trying to fix was actually from the prior release (5.0) and would not meet that bar. The regression was tested by the submitter but there was confusion about the combination of patches tested, ultimately leading to the belief that the one released had been confirmed when it hadn’t (https://lab.civicrm.org/dev/core/issues/106#note_4507)
**Context**
This happened on the second release on the 5.x series. We are definitely still bedding in processes for the new patch release system and communications / timeframes & expectations are still in flux. We have a bottleneck in Tim for the actual releasing of them & it’s hard to know when he will allocate time for what could be described as a ‘regular patch release’.
We put out a patch for a fix that was not actually a 5.1 regression in part because there is a specific intent to get the ‘current 5.x version’ to being really solid & hence we have been very sensitive to regressions, which might have made us jump the gun. We had also had a user complain fairly vigorously that there was a bug fixed in 5.1 that was not released as a patch release on 5.0.
The fix itself was a FULL_GROUP_BY fix (support for recent MySQL & MariaDB releases) - which is an area in which we have had a lot of whack-a-mole and we feel we have a viable approach to this technical problem now (of which this patch is part). Basically the plan is
We have enabled FULL_GROUP_BY mode on the CI box
We are now disabling FULL_GROUP_BY mode for specific problematic queries as a first line of defence
We are trying to fix the queries to comply with the FULL_GROUP_BY standard only in the context of writing unit tests & are trying not to do further fixes on the QUERY builder for them due to it’s complexity & brittleness.
**Recommendations going forwards**
The whole patch release process is in a lot of flux so it’s expected we will have some teething issues. Developing a clearer process / set of expectations is the next step.5.1.1https://lab.civicrm.org/eileen/Product-maintenance/-/issues/9Plan to get our PR queue to a 'to-do list'2018-06-14T01:52:22ZeileenPlan to get our PR queue to a 'to-do list'**Reviewing contributed work**
Reviewing contributions essentially means looking at the PR queue. As at writing there are 149 pull requests in the queue. This probably equates to upwards of 300 hours of work that has already been done a...**Reviewing contributed work**
Reviewing contributions essentially means looking at the PR queue. As at writing there are 149 pull requests in the queue. This probably equates to upwards of 300 hours of work that has already been done and upwards of 300 hours of work to get them all reviewed. Most of the work done was probably funded but perhaps as little as 0 of the hours to get them reviewed is funded.
This resource gap is not easily solved so we need to think about efficiency within it. A specific issue is that it can take as long to find a PR that is reviewable in a queue with ~150 PRs as to actually review them.
**Here are some principles I propose**
1) **Recognise our reviewer resources are our least funded and scarcest resources.** We need to be respectful of them.
* Not only is review work generally unfunded, it is time consuming and it involves taking on responsibility for a change that is not to their benefit but they will sure have to ensure is fixed if it causes a problem, and potentially bear the brunt of any criticism from the community.
* Reviewers will make mistakes and we need to accept that in a way that does not reduce the number of people prepared to do review.
2) **The PR queue should be a place for PRs currently active or fully reviewable.**
* PRs suck up time just by existing. There are a number of PRs in the q which I have spent maybe 6 minutes looking at & concluded that I wouldn’t review it because I would never merge it without a test & there is not one. The problem is I might have spent those 6 minutes 20 times over the lifetime of a PR
3) **A PR does not have to be open for discussion to take place or for it to be accessible later.**
* Where a PR cannot be resolved easily we should consider the primary place to track it as being gitlab & ensure it is linked to an issue there. If the PR is closed it & the discussion on it can be found from gitlab.
4) **We should close PRs that are not ready for review**
* if we look at a PR & think ‘I wouldn’t merge this without a unit test’ we should make that comment on the PR and close the PR with a request to re-open in when there is a test
* If the PR is not truly recent & has become a discussion piece we should close it with a request that it is re-opened when the discussion has resolved.
* If a PR is stale we should try close it & ask for it to be re-opened
* If the quality of code is not really there we should consider closing & doing any mentorship offline
* If the approach is not right we should consider closing & requesting a new PR with a new approach
* If a PR is truly active we should hold off (new PRs often generate discussion & we should encourage that, but once they are no-longer near the top of the list ad hoc discussion becomes unlikely)
* The PR queue is nobody else’s to-do list. Sometimes people track their own work through their own open PRs. That’s not a legitimate use of the PR queue.
* WIP Prs should only stay open for a short period
5) **We should be careful about reviewers writing tests for other people’s PRs (or taking on fixing the bug if the PR is not up to scratch)**
* It has been frequent for reviewers to write tests for other people’s PRs. If someone submits a bug fix that we would not merge without a test we have often written them. However we need to recognise that we are prioritising limited resources and consider closing with a suggestion to re-open when a test is present unless other factors. (There are numerous reasons why a reviewer might still choose to write a test but they should at least feel the option is theirs to request a test without having to leave the PR in the queue)
6) **We should identify & prioritise PRs that have unit tests**
7) **We should identify & prioritise PRs that fix bugs**
8) **We should identify & prioritise PRs where ‘most’ of the work has been done**
* In general by the time a PR has been submitted somewhere between 10% & 90% of the work on that bug has been done. Try to identify & focus on those closer to 90% & close those closer to 10%.
9) **We should identify & prioritise PRs by people who act as reviewers themselves**
* This is out of respect for the work that reviewers do, and in recognition of the fact that a lot of their PRs are in response to that work rather than their own paid work anyway.
10) **We should identify & prioritise PRs that have had endorsement from community members**
In particular this should be when the endorsement says something like ‘I’ve tested this & it works’ as they have done part of the job of the PR already.
11) **We should identify & prioritise PRs by people who have a track record of engaging in review & treating reviewers with respect**
* It is not infrequent for a reviewer to find that the PR submitter feels that by submitting the PR they have done their bit and to not wish to put any more time into the PR after that point. This places an implicit value on their time and none on the reviewer’s time. However, the majority of submitters DO appreciate the time reviewers put into doing review. It is valid for reviewers to focus their efforts based on return on their effort. (Note I think any discussion based on this should focus on ‘reviewing Matt’s PR would be a good idea since he is always really responsive. I’m sure most reviewers have a mental list of people they have been stung by, but discussing those is not helpful). In the bigger picture we need to educate people that submitting a patch for review is not job done but job started.
12) **We should try to improve our review soft skills**
We often see PRs that have had a lot of discussion, sometimes on trivial details, without big picture review. In some cases reviewers have had people change code on fairly stylistic issues without looking at big suggestions like ‘can we share this code with the code in class x’. In others the problem is the actual motivation for the change is not agreed. We need to try to get better at describing the big picture of a PR in some instances. Ask ourselves ‘what would it take to get this merged’
13) **We should provide a way for people to pay for PR attention**
This has been proposed as 2 hour pre-purchased blocks of core team time to review a PR. It does not involve a commitment to merge and may take the form of discussion, a substitute fix and may result in a quote for more time if needed. But, it does mean that the CT member will spend 2 hours within a week on that issue or issues. (We should set up an easy mechanism for this if people do wish to take it up).
14) **We should try to give timely feedback on what might be blocking a PR**
This is really a PR triage process. Often we are reluctant to comment on PRs because we know by doing so we may get drawn in or people. This ties into
15) **We should meet weekly to triage the PR queue**
Often it’s really hard to do the above things in isolation. I propose we have a weekly meeting (or possibly 2 to accommodate timezones) to do PR queue triage.
https://lab.civicrm.org/eileen/Product-maintenance/-/issues/1Figure out how this space could work!2017-08-24T13:22:27ZeileenFigure out how this space could work!My thinking here is that we have about half a dozen people consistently and actively engaged in product maintenance and another dozen or so irregularly engaged at a fairly productive level. We also have some specific offers of assistance...My thinking here is that we have about half a dozen people consistently and actively engaged in product maintenance and another dozen or so irregularly engaged at a fairly productive level. We also have some specific offers of assistance that we have not taken up.
I feel that the people who are doing this work are collaborating very well but not necessarily prioritising and co-ordinating well. For example the top 3 goals in my opinion are
1. To get critical issues down to x
2. To get pull requests down to y
3. To figure out a constructive, blame-free way of diagnosing and learning from regressions (and to draw a line in the sand when the process kicks in and only worry about fixing regressions prior to that point). Note that this goes beyond tech & involves ensuring we are triaging incoming issues. Also we should know that we are always fixing regressions for the next possible release. I feel this happens pretty well once they are identified.
I deliberately left the numbers blank because I feel those should be set collaboratively, but I would like to see a team goal of let's get critical issues down to x by this date.