Dedupe SUBSTR scalability problems
Overview
When length of field is used in dedupe rules, rules tend not to work even on medium size sites.
Reproduction steps
- On a site with 70k contacts, a dudupe rule was defined that included first name (length 1), last name, email, phone, postal code.
- Check for dedupes times out after 20m. On different runs it also took down MySQL with out of memory, and caused restarts of php-fpm.
- Changing rule to first name, last name, email, phone, postal code yields 18.5k potential duplicate pairs in 3 minutes.
- FWIW, using the deduper extension from Eileen didn't help with batch limit of 1000, but did improve things with search limit of 1000.
Current behaviour
Timeouts, mysql restarts, php restarts when using length restrictions on fields.
Expected behaviour
No timeouts or process restarts on medium sized databases even when using length restrictions on fields.
Implementation Alternative
The problem seems to stem from subquery join on substring function, which is well known to be not optimizable by MySQL: ON (SUBSTR(t1.first_name, 1, 1) = SUBSTR(t2.first_name, 1, 1)). Instead of this approach, one can imagine using Temporary tables that include SUBSTR(t.first_name, 1, 1) as a field that has index, and then the join would be optimized. While we use temp tables in reports, they do have the potential of causing their own issues, but I believe they are much reduced from the current situation here.
We don't currently have funding to work on this. Nonetheless, I thought it would be useful to report the problem and ask if there is approval for the concept of using temp tables with indexed fields for dedupe search criteria based on substrings?
I've labelled this as both improvement and bug since it seems to straddle the divide.