Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
C
Core
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 933
    • Issues 933
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Incidents
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
  • Development
  • Core
  • Issues
  • #720

Closed
Open
Opened Feb 13, 2019 by eileen@eileen🎱Owner

Performance change approved - remove mode & median slow queries

Update - simply removing per decision by @colemanw below....

Original

The calculations done for the summary statistics on the contribution search is currently the biggest point of slowness we are facing. The following stats are generated:

'count' (number completed) 'amount' (total amount of completed) 'avg' (average amount of completed) 'mode' (most common value of completed) 'median' (median value of complete) 'cancel_amount' (total of cancelled) 'cancel_count' (number cancelled) 'cancel_avg' (average amount of cancelled)

These are then grouped by currency.

Of these the count and total amount are highly useful whereas the usefulness of mode & median are more niche.

On the other hand it is possible to fix up MOST of these queries to perform well. Mode and median are pretty much impossible to make performant on a large result set, and are actually the main reason our users have to do carefully constrained queries that don't return more than around 50k of results.

Let's assume we do a query that returns 50,000 results and the criteria is the payment_instrument_id then ideally that field will be used as the index on the query and we get the main query returned pretty quickly.

However, in order to do a median query it is necessary to order the results by total_amount. We can't use both indexes, and we can't add combined indexes for every combination of total_amount & possible criteria so we are one way or another going to be either

  • using the total_amount index to sort & doing an unindexed filter - on our whole DB....
  • using the index to filter & doing an unindexed sort on 50k rows
  • using a merged index (these take time to compile)

Some queries are possible to rewrite - I recently got the annual query down from 6 seconds to .02 seconds but the median query just isn't every going to scale well.

Which leaves us with 'how can we help sites that with users are not happy twiddling their thumbs while the median is calculated'.

There are a few options IMHO

  1. just remove median & mean - no-one (by which I mean me) cares about them anyway
  2. add a setting that allows a site to specify which contribution stats they want calculated
  3. only calculate median & mean if the total number of rows is < 1000
  4. add a hook to permit the queries that run to be altered

Of these both 1 & 3 are imposing change on people who might not want change.

Adding a setting seems like a hack. In general adding settings to tweak core behaviour for a different use case/ preferences is almost always a hack - although the use case 'I have a big database' is perhaps a bit more generic than the 'I'd like this page/search bar / widget to behave differently & the least hassle on me is to add a setting'

I do think, however, that 4 is probably the least hacky / most sensible and it will 'gracefully retire itself' when we finally get a better search screen that doesn't use the query object.

I think it would look like

hookAlterQuerySummary($entity, $context, &$callbacks);

(only Contribution is relevant at the moment but passing entity seems to make sense)

We would have to break out the existing queries to their own fns & then we'd get

$callbacks = [ 'CRM_Contact_BAO_Query::getBasicStats', 'CRM_Contact_BAO_Query::getMedian', 'CRM_Contact_BAO_Query::getMean', 'CRM_Contact_BAO_Query::getCancelStats' ]

There would then be a call like

CRM_Contact_BAO_Query::getBasicStats($rowStats, $whereClause, $fromClause)

And $rowStats would be altered by the function adding values & labels so the tpl could iterate through them

Longer term - I would argue the slow stats should probably be ADDED rather than REMOVED by extension as I think shipping something that makes hard-to-justify performance trade-offs is a big call

Edited Feb 18, 2019 by eileen
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
5.12.0
Milestone
5.12.0 (Past due)
Assign milestone
Time tracking
None
Due date
None
Reference: dev/core#720