Does CiviCRM work well with database replication and clustering?
There have been a few discussions on Mattermost in recent months in which people have described problems running Civi in replicated / cluster environments. In particular, Civi's use of temporary tables has caused issues. E.g.
10 Oct 2017 eileen:
We have been having some replication lag issues & one metric we are investigating is the slaves (only) seem to show a large number of open temp tables on one metric. The odd thing about the replication lag is it seems to get worse the longer since a server restart (presumably just the mysql part needs to be restarted).
10 Oct 2017 jackrabbithanna:
Ok we're dealing with Percona DB
Database Error Code: Statement violates GTID consistency: CREATE TEMPORARY TABLE and DROP TEMPORARY TABLE can only be executed outside transactional context. These statements are also not allowed in a function or trigger because functions and triggers are also considered to be multi-statement transactions., 1787
followed by a php error: Fatal error: Uncaught CRM_Core_Exception: [0: Transaction integrity error: Expected to find active frame thrown
Per https://dev.mysql.com/doc/refman/5.7/en/replication-gtids-restrictions.html the multi-statements transactions are not allowed as they could potentially run using the same global transaction ID (GTID)
this error happens when trying to create a Drupal user, with CiviCRM profile on the form
Further comments from Eileen:
OK - I'll have a read of that - the CiviCRM usage of temp tables is pretty common & in particular ties in with the way mailing recipients are calculated & rolled back. However, I'm also reading concerns that replicated DBs don't like temp tables (although it's not blocked in the version we have)
(we use mixed mode replication)
I think it's kind of a big picture question isn't it? I had thought to move all temp table creations to a single function (so if we want to do something different we can do it in there) - the thing is a lot of them are used to calculate group_contact_cache & those will cause rollback to fail
"A recommended practice when using statement-based or mixed-format replication is to designate a prefix for exclusive use in naming temporary tables that you do not want replicated, then employ a --replicate-wild-ignore-table option to match that prefix. For example, you might give all such tables names beginning with norep (such as norepmytable, norepyourtable, and so on), then use --replicate-wild-ignore-table=norep% to prevent them from being replicated." - we could at least do that for reports etc
if you are using statement based replication or mixed replication then any temp tables used to build real tables have to be replicated
or else the query will fail on the slave
it's an improvement basically 'make civicrm support replication'
it does work with replication but I think that's more because various people have just configured it to do so
but I don't think there has ever been any official support for that or effort to understand the design implications
13 Oct eileen:
... our replication issue DOES seem to be related to the temp tables created for civimail at this stage. ... What we are seeing is the DROP statement is written to the binlog BEFORE the create statement is - somewhat in line with https://www.xaprb.com/blog/2007/01/20/how-to-make-mysql-replication-reliable/
(not really clear from there but it says "As I mentioned, each transaction goes into the binlog in the order in which it commits, not in the order in which it is started, so transactions may execute in a different order on the replica. ")
the binlog is parsed on the slaves & they temp tables mount up
not 100% sure if this will solve all our woes - but not letting civicrm use temp tables to create recipient lists seems to be important
We ran a bunch of Civi sites with MySQL replication several years ago and had a lot of problems with it: replication would fall over periodically and then everything would grind to a near standstill as it was re-sync'd.
As a starting point, it would be useful to gather some data on how many people use replication / clustering with Civi, how frequently problems are encountered, what the most common problems are and whether they have been solved.