What's so complicated about a master failover?

June 29, 2017

The more work on orchestrator, the more user input and the more production experience, the more insights I get into MySQL master recoveries. I'd like to share the complexities in correctly running general-purpose master failovers; from picking up the right candidates to finalizing the promotion.

The TL;DR is: we're often unaware of just how things can turn at the time of failover, and the impact of every single decision we make. Different environments have different requirements, and different users wish to have different policies. Understanding the scenarios can help you make the right choice.

The scenarios and considerations below are ones I picked while browsing through the orchestrator code and through Issues and questions. There are more. There are always more scenarios.

I discuss "normal replication" scenarios below; some of these will apply to synchronous replication setups (Galera, XtraDB Cluster, InnoDB Cluster) where using cross DC, where using intermediate masters, where working in an evolving environment.

orchestrator-wise, please refer to "MySQL High Availability tools" followup, the missing piece: orchestrator, an earlier post. Some notions from that post are re-iterated here.

Who to promote?

Largely covered by the missing piece post (skip this section if you've read said post), consider the following:

  • You run with a mixed versions setup. Your master is 5.6, most your replicas are 5.6 but you've upgraded a couple replicas to 5.7. You must not promote those 5.7 servers since you cannot replicate 5.7->5.6. You may lose these servers upon failover.
  • But perhaps by now you've upgraded most of your replicas to 5.7, in which case you prefer to promote a 5.7 server in the event the alternative is losing your 5.7 fleet.
  • You run with both STATEMENT based replication and ROW based. You must not promote a ROW based replica because a STATEMENT based server cannot replicate from it. You may lose ROW servers during the failover.
  • But perhaps by now you've upgraded most of your replicas to ROW, in which case you prefer to promote a ROW server in the event the alternative is losing your ROW fleet.
  • Some servers can't be promoted because they don't use binary logging or log_slave_updates. They could be lost in action.

Noteworthy that MHA solves the above by syncing relay logs across the replicas. I had an attempt at doing the same for orchestrator but was unsatisfied with the results and am wary of hidden assumptions. I do not expect to continue working on that.

Intermediate masters

Recovery of intermediate masters, while simpler, also adds many more questions to the table.

  • An intermediate master crashes. What is the correct handling? Should you move all of its orphaned replicas under another server? Or does this group of replicas form some pact that must stick together? Perhaps you insist on promoting one of them on top of its siblings.
  • On failure, you are likely to prefer promoted server from same DC; this will have least impact on your application.
    • But are you willing to lose a server or two to make that so? Or do you prefer switching to a different DC and not lose any server in the process?
    • A specific, large orchestrator user actually wants to failover to a different DC. Not only that, the customer then prefers flipping all other cluster masters to the other DC (a full-DC failover)
  • An intermediate master had replication filters (scenario: you were working to extract and refactor a subtree onto its own cluster, but the IM crashed before you did so)
    • What do you do? Did you have all subsequent replicas run with same filters?
    • If not, do you have the playbook to do so at time of failure?
  • An intermediate master was writable. It crashed. What do you do? Who is a legitimate replacement? How can you even reconnect the subtree with the main tree?

Candidates

  • Do you prefer some servers over others? Some servers have stronger hardware; you'd like them to be promoted if possible
    • orchestrator can juggle with that to some extent.
  • Are there servers you never want to promote? Servers used by developers; used for backups with open logical volumes; weaker hardware; a DC you don't want to failover to; ...
    • But then again, maybe it's fine if those servers act as intermediate masters? So that they must not be promoted as masters, but are good to participate in an intermediate master failover?
  • How do you even define the promotion types for those servers?
    • We strongly prefer to do this live. Service discovery dictates the type of "promotion rule"; a recurring cronjob keeps updating orchestrator with the server's choice of promotion rule.
    • We strongly discourage configuration based rules, unless for servers which are obviously-never-promote.

What to respond to?

What is a scenario that kicks a failover?

Relate to the missing piece post for the holistic approach orchestrator takes to make a reliable detection. But regardless, do you want to take action where:

  • The master is completely dead and everyone sees that and agrees? (resounding yes)
  • The master is dead to the application but replication seems to be working? (master is at deadlock, but replication threads seem to be happy)
  • The master is half-dead to the application? (no new connections; old connections includign replication connections keep on running!)
  • A DC is network partitioned, the master is alive with some replicas in that DC; but the majority of the replicas are in other DCs, unable to see the master?
    • Is this a question of majority? Of DC preference? Is there at all an answer?

Upon promotion

Most people expect orchestrator to RESET SLAVE ALL and SET read_only=0 upon promotion. This is possible, but the default is not to do so. Why?

  • What if your promoted server still has unapplied relay logs? This can happen in the event all replicas were lagging at the time of master failure. Do you prefer:
    • To promote, RESET SLAVE ALL and lose all those relay logs? You gain availability at the expense of losing data.
    • To wait till SQL_THREAD has consumed the logs? You keep your data at the expense of availability.
    • To abort? You let a human handle this; this is likely to take more time.
  • What do you do with delayed/slow replicas? It could take a while to connect them back to the promoted master. Do you prefer:
    • Waiting for them to connect; delay promotion
    • Advertise new master, then asynchronously work to connect them: you may have improved availability, but at reduced capacity, which is an availability issue in itself.

Flapping

You wish to avoid flapping. A scenario could be that you're placing such load on your master that it crashes; the next server to promote as master will have the same load, will similarly crash. You do not wish to exhaust your fleet.

What makes a reasonable anti-flapping rule? Options:

  • Upon any failure in a cluster, block any other failover on that cluster for X minutes
    • What happens if a major power down issue requires two or three failovers on the same cluster?
  • Only block further master failovers, but allow intermediate master failovers as much as needed
    • There could be intermediate master exhaustion, as well
  • Only allow one of each (master and intermediate master), then block for X minutes?
  • Allow a burst of failovers for a duration of N seconds, then block for X minutes?

Detection spam

You don't always take action on failure detection. Maybe you're blocked via anti-flapping on an earlier failure; or have configured to not automatically failover.

Detection is the basis to, but independent of failover. You should have detection in place even if not failing over.

You wish to run detection continuously.

  • So if failover does not take place, detection will keep noticing the same problem again and again.
    • You get spammed by alerts
  • Only detect once?
    • Been there. When you really need that detection to alert you find out it alerted once 6 hours ago and you ignored it because it was not actionable at the time.
  • Only detect a change in diagnosis?
    • Been there. Diagnosis itself can flap back and forth. You get noise.
  • Block detection for X minutes?
    • What is a good tradeoff between noise and visibility?

Back to life

And old sub-tree comes back to life. Scenario: DC power failure.

This subtree claims "I'm cluster X". What happens?

  • Your infrastructure needs to have memory and understanding that there's already a running cluster called X.
  • Any failures within that subtree must not be interpreted as a "cluster X failure", or else you kick a cluster X failover when there is, in fact, a running cluster X. See this PR and related links. At this time orchestrator handles this scenario correctly.
  • When do you consider a previously-dead server to be alive and well?
    • I do mean, automatically and reliably so? See same PR for orchestrator's take.

User control

Failover tooling must always let a human decide something is broken. It must always allow for an urgent failover, even if nothing seems to be wrong to the system.

Isn't this just so broken? Isn't synchronous replication the answer?

The world is broken, and distributed systems are hard.

Synchronous replication is an answer, and solves many (I think) of the above issues, creating its own issues, but I'm not an expert on that.

However noteworthy that when some people think about synchronous replication they forget about cross-DC replication and cross-DC failovers,on upgrades and experiments. The moment you put intermediate masters at play, you're almost back to square one with many of the above questions again applicable to your use case.

  • Rick James

    What about "split brain"?

  • Shlomi Noach

    Hi Rick,
    I hinted at split brain with "A DC is network partitioned..." ; An orchestrator PR to handle split brain is https://github.com/github/orchestrator/pull/183

  • Dane Miller

    I'm skeptical that automation makes sense for some of the mixed replication setups you describe. Mix of 5.6/5.7, mix of statement/rbr. Does it make sense to make failover decisions based on majority/quorum concepts for mysql async replication? I tend to think not, but I suspect I haven't seen nearly the number of failovers as you.

    My preference, and how we operate at my site, is to require homogenous mysql clusters by default. Any mysql version or replication discrepancies trigger alerts. That doesn't mean you can't run with a mix of hosts/versions/settings, it just means we must explicitly pick a homogenous set of hosts that will participate in automated failover. The rest will be orphaned after a failover, and require manual cleanup.

    Just curious, have you found benefits to automating failover for this genre of complex mixed configuration mysql replication topologies? Could you share some examples? Have you experienced any costs associated with building this complexity into orchestrator?

    As always, thanks for all your work and posts on orchestrator, ghost, and mysql. I'm a big fan!
    Dane

  • Shlomi Noach

    @Dane,

    Over the course of the past year, we have migrated our MySQL clusters from statement based replication to row based replication, and from 5.6 to 5.7.
    These migrations were gradual. The migration from 5.6 to 5.7 took months of experiments, measurements and testing.
    We slowly added more and more servers, and when we were convinced this would work out we did a marathon of upgrades for the remaining servers.
    orchestrator allows us to concentrate on what we want to do without having to worry about having each upgrade hard coded in some config file to indicate whether the server should or should not participate in a failover path.
    We had a couple cases where we downgraded servers from 5.7 back to 5.6 because we were getting inconsistent metrics. To require each such upgrade/downgrade to be manually followed by a config change would have been quite the hassle.

    The thing is, we're really not a big shop for MySQL. At a little over 100 servers we're a busy, but not huge shop.
    Previously I worked at Booking.com with X,000 MySQL servers. The mere thought of having anything manual at the single server level was excruciating. We had to have as much automation as possible "or else".

    5.6-5.7, sbr-rbr, special roles (delayed, backup, dev)... those are all real world scenarios.

    Your preference is to require homogenous mysql clusters by default. "By default" means a time where "nothing special happens" - and that's exactly where I see the difference. We almost always have something special happening.

    The costs associated with building this complexity into orchestrator are: development time ; code complexity ; behavior complexity.

    The failover logic is not "beautiful code". It is filled with conditions and use cases that make sense in a reality. Some scenarios are yet challenging the orchestrator logic and we need to adapt.

    I can share a public statement by Booking.com that the experience a failover every day, and orchestrator does the right thing every day (and sometimes it does not do the best thing, and I've seen some crazy scenarios there).
    The assumptions made before orchestrator was in the game were limiting to the deployments, to the hardware, to outage time.

    With orchestrator, and this isn't a sales pitch, you are more free to choose your preferred topology/technology/setup and just assume orchestrator will deal with it.

    Anyway those are my thoughts for the day. Reading the above it really sounds like marketing. Not everything is great & wonderful & pretty, but I think the dynamic approach (as opposed to configuration based approach) is a huge step and a relief to engineers.

    Thank you

 
Powered by Wordpress and MySQL. Theme by openark.org