Dedupe performance & hooks
I recently did some work on improving the dedupe performance where 2 required fields are in the same table and was able to reduce the time taken for the query from an unknown number of hours to 170 seconds (less with an extra index) & there is scope for further improvement.
The PR is here https://github.com/civicrm/civicrm-core/pull/30591
However, my reservation about the PR as is is what it does is calculates the queries the old way, checks to see if any hooks have changed the somewhat obtuse array and if not it discards it and calculates the queries the more performant way. It does this because a hook got stuck in the middle of some janky code without thinking about the best way to do the hook (in fact it's one of those hooks that gets passed context
and the parameters and expectations are completely different if the context changes).
Note that there are 2 different duplicate flows which I'm thinking of as getDuplicates()
and findDuplicates()
- although the existing code / hooks don't reflect that terminology
What I am thinking to do is
a) use 2 new hooks
-
getDuplicates
- this hook would phase outfindDuplicates()
- which has a rough signature. -
findExistingDuplicates()
- this would be the hook when we are looking for existing pairs of duplicates in the database. I kinda wanted to call itfindDuplicatePairs()
but I feel like there is demand from a UI POV to be able to find tuples etc of duplicates and I don't think our hook name / design should preclude some later clever adaption there. I also wanted to call itfindDuplicates()
but there is a hook like that
b) move our existing implementation into an extension (legacy dedupe finder) which will continue firing the to-be-deprecated hooks for people. Implement the new faster one as a hook in core, but it would not run unless the extension is disabled - this means that unless people disable the extension they would not get the new behaviour - which feels kinda safer. We could do this by giving the legacy code a lower weight & I'm kinda thinking a stopPropagation
parameter to deter the non-legacy code if the legacy code runs. The new code would not run the legacy hooks - disabling the extension disables the calls to them.
There are only 2 current ingresses to the dedupe finder
dupesByParams()
- which would be for getDuplicates()
and
dupes()
- which maps to findDuplicates()
They have different expectations of what they get back (one gets an array of pairs of possible duplicates & the other gets an array of ids that are possible duplicates of the provided criteria) so I think it makes sense to tackle these 2 sides of the equation fairly separately (also good because my current focus is the findDuplicates()
whereas I think getDuplicates()
requires more scoping.
findExistingDuplicates()
Currently there is only 1 path into this code - the CRM_Core_BAO_PrevNextCache::refillCache()
calls CRM_Dedupe_Finder::dupesInGroup()
which calls CRM_Dedupe_Finder::dupes()
or it resolves v3 $criteria
into contact IDs and calls CRM_Dedupe_Finder::dupes()
directly.
- input =
$ruleGroupID
,$contactIDs
and$checkPermissions
- output = array of pairs of ids + weight - ie
[[1,2,10], [1,5,10]
It's worth taking a moment to understand what the contactIDs
mean in this context. They are primarily used to restrict the scope of the query - so in a database of (ahem) some millions of contact an array of contact IDs might hold 5000 contact IDs and an array of duplicate will be returned of pairs which match one of the contact IDs (ie if you passed in just contact ID [25] you would get returned pairs like [[25, 30, 10], [25,40,10]]
(but if 30 & 40 match each other they would not be included).
It's currently possible to pass in apiv3 style criteria via the url and via the command line or via the UI you can select a group which has a similar effect. The $limit
is also part of this - ie find pairs for the first 5000 contacts starting from contact ID 1,000,000 - the queries wind up looking like
SELECT id1, id2, weight FROM (SELECT t1.id id1, t2.id id2, 18 weight FROM civicrm_contact t1 INNER JOIN civicrm_contact t2 ON (t1.first_name IS NOT NULL AND t2.first_name IS NOT NULL AND t1.first_name = t2.first_name AND t1.first_name <> '' AND t2.first_name <> '') AND (t1.last_name IS NOT NULL AND t2.last_name IS NOT NULL AND t1.last_name = t2.last_name AND t1.last_name <> '' AND t2.last_name <> '') WHERE t1.contact_type = 'Individual' AND t2.contact_type = 'Individual' AND t1.id < t2.id AND t1.id IN (3,4,5,6,7,8,9)
UNION SELECT t1.id id1, t2.id id2,18 weight FROM civicrm_contact t1 INNER JOIN civicrm_contact t2 ON (t1.first_name IS NOT NULL AND t2.first_name IS NOT NULL AND t1.first_name = t2.first_name AND t1.first_name <> '' AND t2.first_name <> '') AND (t1.last_name IS NOT NULL AND t2.last_name IS NOT NULL AND t1.last_name = t2.last_name AND t1.last_name <> '' AND t2.last_name <> '') WHERE t1.contact_type = 'Individual' AND t2.contact_type = 'Individual' AND t1.id < t2.id AND t2.id IN (3,4,5,6,7,8,9)) subunion
I have some reservations about passing contactIDs
(just because it could be big array) & wondered about passing out a query object that could be used to generate the list of IDs or simply the name of a temp table the IDs have been inserted into (ready for creating queries against). Although, notably in the example above any temp table would not be double-joinable so perhaps contactIDs is the right parameter - or just $ids maybe.
I'm not sure we have much precedent for $stopPropagation
but I feel it's probably needed. Also checkPermissions
is required because it's not just the passed in contact, but the one they match to that needs checking.
public static function findExistingDuplicates(&$duplicates, array $contactIDs, int $ruleGroupID, bool $checkPermissions, bool $stopPropagation, bool $isOnlyIncludeMatchesWithinContactIDs) {
getDuplicates()
There is an existing hook in the getDuplicates()
flow - called findDuplicates()
it doesn't necessarily need to change - the legacy extension & eventual core new handler could just implement that hook. Or perhaps the performance improvement is mostly about the other flow & we can fix the other while ignoring this & just split the 2 paths. However a couple of things it would be good to change about the existing hook in a replacement
Note if we do replace it it would be a case of implementing the new hook & having it call the old hook in the legacydedupefinder extension (enabled on existing & new installs & hidden to start with)
- I think the
$dedupeParams
is a bit crazy - it has a list of parameters in apiv3 style in a flat array with other values like check_permission. I think theparameters
should be their own variable & should be in apiv4 style. Where I struggle is that we don't want things likegender_id:name
to get passed through and we maybe want arrays converted to strings. - I think we should ditch
rule
and instead pass$rule_group_id
as an array - ie the import code at least partially supports (probably not in the UI at this point) a cascading list of rules to try. - I think excluded_ids not be part of dedupeParams - it basically holds the logged in user & perhaps the form contact. I suspect the hook does not actually need this it just gets it because it always has - I'd be inclined to leave it out on the first iteration of a replacement hook & add it in if there is demand (since the old hook can still be used for some time - ie via the legacydupefinder extension.
- Do we want
$entity
as a parameter - That context is nasty - it can hold
event_id
& nothing else & is called from one place - I would probably go with not making it part of the new hook unless requested & scope it at that point as I suspect it is not used & dates right back. - The equivalent of
stopPropagation
in the hook is to sethandled
as a key in the array - ug - I wonder if the returned ids should have
weight
attached do denote better matches
Dedupe implementations
I checked universe & for the hooks I want to deprecate out into the legacy finder I found one use consumer (number 3 below). I am aware that the other hook dupeQuery()
has at least one user from discussions on chat - but their specific usage is in the supported_fields
context (and could probably as of 5.78 be done using the schema hook) - so I think there is just the one consumer to check out during any change
I'm aware of multiple dedupe implementations out there so notes on the ones I checked
-
Deduper - this is obviously the one we use for WMF - it does nothing on the dedupe finder side but provides an alternate dedupe UI and a bunch of conflict resolvers. It also has a concept described in Xdedupe as a 'picker' - ie various ways of chosing which contact has 'better' information. The functionality is implemented during merge resolution - using the truly awful hook in play there - but that is out of scope. The WMF practice is to run a cron all the time attempting to dedupe contacts with the same email, provided any conflicts can be resolved by the resolvers. Other deduping is done via the deduper UI using CiviCRM dedupe rules
-
XDedupe - this does not interact with the main CiviCRM deduper hooks at all. It has an alternate UI for finding duplicates and alternate sql for finding them. The resolvers and pickers it implements only kick in via it's own merge code & mechanisms. It does not interact with any core hooks. Note that the finders it implements handle limiting in a fundamentally different way - on our staging I tried finding contacts in a particular smart group based on first name + last name matching. Using the core civi rules (with the patch that started this conversation to get it to run performantly) 24,973 duplicate pairs were found. Using the Xdupe code 47 'tuples were found' - all but 3 being pairs. The fundamental difference is the core code finds matches in the whole database for the contacts in the group - so with a loose first-name + last-name match on 7k contacts 24k is reasonable. The Xdedupe search only found contacs where BOTH are in the group. I think there are use cases for both (for us we are more interested in the first - ie find me all the contacts anywhere in our database I might need to merge into this group I'm about to email, before I email them). Perhaps we should design our hooks to have at least the potential to later add a parameter for the type of limiting the contactIDs denotes because it would be easy enough to implement either in most cases
-
Dedupe customisations - customises event forms, profiles, contribution pages by allowing profile selection. This is the only implementer I found of the annoying
findDuplicates()
hook. My heart goes out to @sunil who wrote it as he had to search the back trace to figure out which form called the hook - I think it should be possible to make changes without affect the implementation but it also feels like the functionality would reasonably be in core - although that would probably require schema changes & hence be a bit much. -
Automatic deduplication job - this was written 11 years ago so might have fulfilled a different role then but it seems to overlap pretty substantially with the core batch dedupe job. It has some resolvers although less than the first 2 extensions but adds an interesting new 'picker'
group_member_is_canonical=123
-
Dedupe Monitor This extension configures a background job to scan for duplicates against the various duplicate rules. When it finds duplicates it puts them into a 'batch' - which appears to be a Group in CiviCRM that relates to that rule and can then be manually reviewed. The extension says it saves the results on this 'batches' - so far I think what it means is that it puts the contacts into groups which should be quick to load the dedupe rule against (as those for whom it does not apply are excluded) rather than that it stores actual matches. There is a dashlet to expose the existence of these groups.