What's so complicated about a master failover?

June 29, 2017

The more work on orchestrator, the more user input and the more production experience, the more insights I get into MySQL master recoveries. I'd like to share the complexities in correctly running general-purpose master failovers; from picking up the right candidates to finalizing the promotion.

The TL;DR is: we're often unaware of just how things can turn at the time of failover, and the impact of every single decision we make. Different environments have different requirements, and different users wish to have different policies. Understanding the scenarios can help you make the right choice.

The scenarios and considerations below are ones I picked while browsing through the orchestrator code and through Issues and questions. There are more. There are always more scenarios.

I discuss "normal replication" scenarios below; some of these will apply to synchronous replication setups (Galera, XtraDB Cluster, InnoDB Cluster) where using cross DC, where using intermediate masters, where working in an evolving environment.

orchestrator-wise, please refer to "MySQL High Availability tools" followup, the missing piece: orchestrator, an earlier post. Some notions from that post are re-iterated here.

Who to promote?

Largely covered by the missing piece post (skip this section if you've read said post), consider the following:

  • You run with a mixed versions setup. Your master is 5.6, most your replicas are 5.6 but you've upgraded a couple replicas to 5.7. You must not promote those 5.7 servers since you cannot replicate 5.7->5.6. You may lose these servers upon failover.
  • But perhaps by now you've upgraded most of your replicas to 5.7, in which case you prefer to promote a 5.7 server in the event the alternative is losing your 5.7 fleet.
  • You run with both STATEMENT based replication and ROW based. You must not promote a ROW based replica because a STATEMENT based server cannot replicate from it. You may lose ROW servers during the failover.
  • But perhaps by now you've upgraded most of your replicas to ROW, in which case you prefer to promote a ROW server in the event the alternative is losing your ROW fleet.
  • Some servers can't be promoted because they don't use binary logging or log_slave_updates. They could be lost in action.

Noteworthy that MHA solves the above by syncing relay logs across the replicas. I had an attempt at doing the same for orchestrator but was unsatisfied with the results and am wary of hidden assumptions. I do not expect to continue working on that.

Intermediate masters

Recovery of intermediate masters, while simpler, also adds many more questions to the table.

  • An intermediate master crashes. What is the correct handling? Should you move all of its orphaned replicas under another server? Or does this group of replicas form some pact that must stick together? Perhaps you insist on promoting one of them on top of its siblings.
  • On failure, you are likely to prefer promoted server from same DC; this will have least impact on your application.
    • But are you willing to lose a server or two to make that so? Or do you prefer switching to a different DC and not lose any server in the process?
    • A specific, large orchestrator user actually wants to failover to a different DC. Not only that, the customer then prefers flipping all other cluster masters to the other DC (a full-DC failover)
  • An intermediate master had replication filters (scenario: you were working to extract and refactor a subtree onto its own cluster, but the IM crashed before you did so)
    • What do you do? Did you have all subsequent replicas run with same filters?
    • If not, do you have the playbook to do so at time of failure?
  • An intermediate master was writable. It crashed. What do you do? Who is a legitimate replacement? How can you even reconnect the subtree with the main tree?

Candidates

  • Do you prefer some servers over others? Some servers have stronger hardware; you'd like them to be promoted if possible
    • orchestrator can juggle with that to some extent.
  • Are there servers you never want to promote? Servers used by developers; used for backups with open logical volumes; weaker hardware; a DC you don't want to failover to; ...
    • But then again, maybe it's fine if those servers act as intermediate masters? So that they must not be promoted as masters, but are good to participate in an intermediate master failover?
  • How do you even define the promotion types for those servers?
    • We strongly prefer to do this live. Service discovery dictates the type of "promotion rule"; a recurring cronjob keeps updating orchestrator with the server's choice of promotion rule.
    • We strongly discourage configuration based rules, unless for servers which are obviously-never-promote.

What to respond to?

What is a scenario that kicks a failover?

Relate to the missing piece post for the holistic approach orchestrator takes to make a reliable detection. But regardless, do you want to take action where:

  • The master is completely dead and everyone sees that and agrees? (resounding yes)
  • The master is dead to the application but replication seems to be working? (master is at deadlock, but replication threads seem to be happy)
  • The master is half-dead to the application? (no new connections; old connections includign replication connections keep on running!)
  • A DC is network partitioned, the master is alive with some replicas in that DC; but the majority of the replicas are in other DCs, unable to see the master?
    • Is this a question of majority? Of DC preference? Is there at all an answer?

Upon promotion

Most people expect orchestrator to RESET SLAVE ALL and SET read_only=0 upon promotion. This is possible, but the default is not to do so. Why?

  • What if your promoted server still has unapplied relay logs? This can happen in the event all replicas were lagging at the time of master failure. Do you prefer:
    • To promote, RESET SLAVE ALL and lose all those relay logs? You gain availability at the expense of losing data.
    • To wait till SQL_THREAD has consumed the logs? You keep your data at the expense of availability.
    • To abort? You let a human handle this; this is likely to take more time.
  • What do you do with delayed/slow replicas? It could take a while to connect them back to the promoted master. Do you prefer:
    • Waiting for them to connect; delay promotion
    • Advertise new master, then asynchronously work to connect them: you may have improved availability, but at reduced capacity, which is an availability issue in itself.

Flapping

You wish to avoid flapping. A scenario could be that you're placing such load on your master that it crashes; the next server to promote as master will have the same load, will similarly crash. You do not wish to exhaust your fleet.

What makes a reasonable anti-flapping rule? Options:

  • Upon any failure in a cluster, block any other failover on that cluster for X minutes
    • What happens if a major power down issue requires two or three failovers on the same cluster?
  • Only block further master failovers, but allow intermediate master failovers as much as needed
    • There could be intermediate master exhaustion, as well
  • Only allow one of each (master and intermediate master), then block for X minutes?
  • Allow a burst of failovers for a duration of N seconds, then block for X minutes?

Detection spam

You don't always take action on failure detection. Maybe you're blocked via anti-flapping on an earlier failure; or have configured to not automatically failover.

Detection is the basis to, but independent of failover. You should have detection in place even if not failing over.

You wish to run detection continuously.

  • So if failover does not take place, detection will keep noticing the same problem again and again.
    • You get spammed by alerts
  • Only detect once?
    • Been there. When you really need that detection to alert you find out it alerted once 6 hours ago and you ignored it because it was not actionable at the time.
  • Only detect a change in diagnosis?
    • Been there. Diagnosis itself can flap back and forth. You get noise.
  • Block detection for X minutes?
    • What is a good tradeoff between noise and visibility?

Back to life

And old sub-tree comes back to life. Scenario: DC power failure.

This subtree claims "I'm cluster X". What happens?

  • Your infrastructure needs to have memory and understanding that there's already a running cluster called X.
  • Any failures within that subtree must not be interpreted as a "cluster X failure", or else you kick a cluster X failover when there is, in fact, a running cluster X. See this PR and related links. At this time orchestrator handles this scenario correctly.
  • When do you consider a previously-dead server to be alive and well?
    • I do mean, automatically and reliably so? See same PR for orchestrator's take.

User control

Failover tooling must always let a human decide something is broken. It must always allow for an urgent failover, even if nothing seems to be wrong to the system.

Isn't this just so broken? Isn't synchronous replication the answer?

The world is broken, and distributed systems are hard.

Synchronous replication is an answer, and solves many (I think) of the above issues, creating its own issues, but I'm not an expert on that.

However noteworthy that when some people think about synchronous replication they forget about cross-DC replication and cross-DC failovers,on upgrades and experiments. The moment you put intermediate masters at play, you're almost back to square one with many of the above questions again applicable to your use case.

 
Powered by Wordpress and MySQL. Theme by openark.org