Or: How do you reach the conclusion your MySQL master/intermediate-master is dead and must be recovered?
This is an attempt at making a holistic diagnosis of our replication topologies. The aim is to cover obvious and not-so-obvious crash scenarios, and to be able to act accordingly and heal the topology.
At Booking.com we are dealing with very large amounts of MySQL servers. We have many topologies, and many servers in each topology. See past numbers to get a feel for it. At these numbers failures happen frequently. Typically we would see normal slaves failing, but occasionally -- and far more frequently than we would like to be paged for -- an intermediate master or a master would crash. But our current (and ever in transition) setup also include SANs, DNS records, VIPs, any of which can fail and bring down our topologies.
Tackling issues of monitoring, disaster analysis and recovery processes, I feel safe to claim the following statements:
- The fact your monitoring tool cannot access your database does not mean your database has failed.
- The fact your monitoring tool can access your database does not mean your database is available.
- The fact your database master is unwell does not mean you should fail over.
- The fact your database master is alive and well does not mean you should not fail over.
Bummer. Let's review a simplified topology with a few failure scenarios. Some of these scenarios you will find familiar. Some others may be caused by setups you're not using. I would love to say I've seen it all but the more I see the more I know how strange things can become. Continue Reading »