Revisiting deadlocks
Some time ago we added handling to the DataObject class to catch and retry deadlocks. This seems to work and be helpful in some cases but in others it doesn't work and is possibly harmful. I want to re-open the discussion on how & where we catch them & roll them back.
Why do we have deadlocks Deadlocks are a 'natural' part of mysql scalability. Mysql tries to manage contention under load to the extent that can be done at the DB layer, allowing tables & rows to be locked & queuing transactions that need to use those resources to complete after that. However, in some cases mysql cannot resolve it and it returns a deadlock error. The expectation is that the application layer will handle & retry.
For example the query in this PR was causing deadlocks - https://github.com/civicrm/civicrm-core/pull/16080 - here is how I think the flow worked - note the queries are not slow but this is under high volume.
- contact create is called by 3 different processes in pretty quick succession
- The query to update the employer name field is called by all 3
- the first query 'gets the lock' - the other 2 get shared locks while they wait
- the first query finishes. Neither of the other 2 can get the exclusive lock as they both have shared locks
- one is reverted & a deadlock is thrown
- the deadlock is caught in packages/DB/Dataobject
- the Dataobject code retries the query and succeeds. The transaction succeeds.
When don't we do a good job with deadlocks The above scenario is good because once the deadlock was resolved we were able to retry and it worked. However, it turns out that mysql often rolls back more than just the query it was doing when it decided there was a deadlock. Per https://dev.mysql.com/doc/refman/8.0/en/innodb-deadlock-detection.html " InnoDB automatically detects transaction deadlocks and rolls back a transaction or transactions to break the deadlock."
We are seeing a consistent pattern where under high volume we see deadlocks on INSERT INTO civicrm_email
- via contact.create api (called in turn from a job api by drush). The INSERT email fails but on retry it hits a constraint error - because unknown to the php layer the INSERT INTO civicrm_contact
statement was also rolled back.
In this case the DB state is hopefully still consistent as the failure should have triggered more rollbacks but it's at least theoretically possible to have lost data in the rollback & then have the query succeed when retried - so the database is in an inconsistent state.
My high level conclusion is that we should be catching deadlocks & retrying higher up the stack - but I'm still trying to figure out where.