Performance change approved - remove mode & median slow queries
@colemanw below....Update - simply removing per decision by
The calculations done for the summary statistics on the contribution search is currently the biggest point of slowness we are facing. The following stats are generated:
'count' (number completed) 'amount' (total amount of completed) 'avg' (average amount of completed) 'mode' (most common value of completed) 'median' (median value of complete) 'cancel_amount' (total of cancelled) 'cancel_count' (number cancelled) 'cancel_avg' (average amount of cancelled)
These are then grouped by currency.
Of these the count and total amount are highly useful whereas the usefulness of mode & median are more niche.
On the other hand it is possible to fix up MOST of these queries to perform well. Mode and median are pretty much impossible to make performant on a large result set, and are actually the main reason our users have to do carefully constrained queries that don't return more than around 50k of results.
Let's assume we do a query that returns 50,000 results and the criteria is the payment_instrument_id then ideally that field will be used as the index on the query and we get the main query returned pretty quickly.
However, in order to do a median query it is necessary to order the results by total_amount. We can't use both indexes, and we can't add combined indexes for every combination of total_amount & possible criteria so we are one way or another going to be either
- using the total_amount index to sort & doing an unindexed filter - on our whole DB....
- using the index to filter & doing an unindexed sort on 50k rows
- using a merged index (these take time to compile)
Some queries are possible to rewrite - I recently got the annual query down from 6 seconds to .02 seconds but the median query just isn't every going to scale well.
Which leaves us with 'how can we help sites that with users are not happy twiddling their thumbs while the median is calculated'.
There are a few options IMHO
- just remove median & mean - no-one (by which I mean me) cares about them anyway
- add a setting that allows a site to specify which contribution stats they want calculated
- only calculate median & mean if the total number of rows is < 1000
- add a hook to permit the queries that run to be altered
Of these both 1 & 3 are imposing change on people who might not want change.
Adding a setting seems like a hack. In general adding settings to tweak core behaviour for a different use case/ preferences is almost always a hack - although the use case 'I have a big database' is perhaps a bit more generic than the 'I'd like this page/search bar / widget to behave differently & the least hassle on me is to add a setting'
I do think, however, that 4 is probably the least hacky / most sensible and it will 'gracefully retire itself' when we finally get a better search screen that doesn't use the query object.
I think it would look like
hookAlterQuerySummary($entity, $context, &$callbacks);
(only Contribution is relevant at the moment but passing entity seems to make sense)
We would have to break out the existing queries to their own fns & then we'd get
$callbacks = [ 'CRM_Contact_BAO_Query::getBasicStats', 'CRM_Contact_BAO_Query::getMedian', 'CRM_Contact_BAO_Query::getMean', 'CRM_Contact_BAO_Query::getCancelStats' ]
There would then be a call like
CRM_Contact_BAO_Query::getBasicStats($rowStats, $whereClause, $fromClause)
And $rowStats would be altered by the function adding values & labels so the tpl could iterate through them
Longer term - I would argue the slow stats should probably be ADDED rather than REMOVED by extension as I think shipping something that makes hard-to-justify performance trade-offs is a big call