"MySQL High Availability tools" followup, the missing piece: orchestrator

April 6, 2017

I read with interest MySQL High Availability tools - Comparing MHA, MRM and ClusterControl by SeveralNines. I thought there was a missing piece in the comparison: orchestrator, and that as result the comparion was missing scope and context.

I'd like to add my thoughts on topics addressed in the post. I'm by no means an expert on MHA, MRM or ClusterControl, and will mostly focus on how orchestrator tackles high availability issues raised in the post.

What this is

This is to add insights on the complexity of failovers. Over the duration of three years, I always think I've seen it all, and then get hit by yet a new crazy scenario. Doing the right thing automatically is difficult.

In this post, I'm not trying to convince you to use orchestrator (though I'd be happy if you did). To be very clear, I'm not claiming it is better than any other tool. As always, each tool has pros and cons.

This post does not claim other tools are not good. Nor that orchestrator has all the answers. At the end of the day, pick the solution that works best for you. I'm happy to use a solution that reliably solves 99% of the cases as opposed to an unreliable solution that claims to solve 99.99% of the cases.

Quick background

orchestrator is actively maintained by GitHub. It manages automated failovers at GitHub. It manages automated failovers at Booking.com, one of the largest MySQL setups on this planet. It manages automated failovers as part of Vitess. These are some names I'm free to disclose, and browsing the issues shows a few more users running failovers in production. Otherwise, it is used for topology management and visualization in a large number of companies such as Square, Etsy, Sendgrid, Godaddy and more.

Let's now follow one-by-one the observations on the SeveralNines post.

Flapping

orchestrator supports an anti-flapping mechanism. Once an automated failover kicks off, no additional automated failover will run on the same cluster for the duration of a pre-configured <code>RecoveryPeriodBlockSeconds</code>.  We, for example, set that time to 1 hour.

However, orchestrator also supports acknowledgements. A human (or a bot, for that matter, accessing the API or just running the command line) can acknowledge a failover. Once a failover is acknowledged, the block is removed. The next incident requiring a failover is free to proceed.

Moreover, a human is always allowed to forcibly invoke a failover (e.g. via orchestrator -c graceful-master-takeover or  orchestrator -c force-master-takeover). In such case, any blocking is ignored and orchestrator immediately kicks in the failover sequence.

Lost transactions

orchestrator does not pull binlog data from the failed master and only works with the data available on the replicas. There is a potential for data loss.

Note: There was actually some work into a MHA-like synching of relay logs, and in fact most of it is available right now in orchestrator, synching relaylogs via remote SSH and without agents. See https://github.com/github/orchestrator/issues/45 for some pointers. We looked into this for a couple months but saw some dangers and issues such as non-atomic relay log entries in RBR and others. We chose to put this on-hold, and I advise to not use this functionality in orchestrator.

Lost transactions, notes on semi-sync

Vitess have contributed a semi-sync failover mechanism (discussion). So you are using semi-sync? That's wonderful. You've failed over and have greatly reduced the amount of lost data. What happens with your new setup? Will your recovery re-apply semi sync? Vitess's contribution does just that and makes sure a new relica takes the semi-sync role.

Lost transactions, notes on "most up to date replica"

There is this fallacy. It has been proven to be a fallacy many time in production, that I've witnessed. I'm sure this happened somewhere else in the universe that I haven't witnessed.

The fallacy says: "when master fails, we will promote the most up-to-date replica". The most up-to-date may well be the wrong choice. You may wish to skip and lose the most up-to-date replica. The assumption is a fallacy, because many times our production environment is not sterile.

Consider:

  • You're running 5.6 and wish to experiment/upgrade to 5.7. You will like upgrade a replica or two, measure their replication capacity, their query latency, etc. You now have a non-sterile topology. When the master fails, you must not promote your 5.7 replica because you will not be able to (or wouldn't want to take the chance) replicate from that 5.7 onto the rest of your fleet, which is 5.6.
    • We are looking into 5.7 upgrade for a few months now. Most careful places I know likewise take months to run such an upgrade. These are months where your topology is not sterile.

So you'd rather lose the most up-to-date replica than lose the other 10 replicas you have.

  • Likewise, you run with STATEMENT based replication. You wish to experiment with ROW based replication. Your topology is not sterile. You must not upgrade your ROW replica, because the rest of the replicas (assuming log_slave_updates) would not be able to replicate.

orchestrator understands all the above and makes the right call. It would promote the replica that would ensure the survival of your fleet, and prefers to lose data.

  • You have a special replica with replication filters. More often than desired, this happens. Perhaps you're splitting out a new sub-cluster, functionally partitioning away some tables. Or you're running some exporter to Hadoop and only filter specific events. Or... You must not promote that replica.

orchestrator would not promote that replica.

Let's look at something crazy now:

  • Getting back to the 5.7 upgrade, you have now upgraded 8 out of your 10 replicas to 5.7. You now do want to promote the 5.7 replica. It's a matter of how many replicas you'd lose if you promoted _this one_ as opposed to _that one_.

orchestrator makes that calculation. Hey, the same applies for STATEMENT and ROW based replicas. Or maybe you have both 5.7 and RBR experiments (tsk tsk tsk, but life is challanging) at the same time. orchestrator will still pick the replica whose promotion will get the majority of your fleet intact.

  • The most up-to-date replica is on a different data center.

orchestrator can, but prefer not to promote it. But then, it also supports a 2 step promotion. If possible, it would first promote that most up-to-date replica from the other DC, then let local DC replicas catch up, then reverse replication and place a local replica on top, making the master on local DC again.

I cannot stress enough how useful and important this is.

This conveniently leads us to...

Roles

Fallacy: using configuration to white-list or black-list servers.

You'd expect to overcome the above 5.7 or ROW etc. issues by carefully marking your servers as blacklisted.

We've been there. This works on a setup with 10 servers. I claim this doesn't scale.

Again, I wish to be clear: if this works for you, great! Ignore the rest of this section. I suggest that as you grow, this becomes more and more a difficult problem. At the scale of a 100 servers, it's a serious pain. Examples:

  • You set global binlog_format='ROW'. Do you then immediately follow up to reconfigure your HA service to blacklist your server?
  • You provision a new box. Do you need to go and reconfigure your HA service, adding the box as white-listed?
  • You need to run maintenance on a server. Do you rush to reconfigure HA service?

You can invest in automating the reconfiguration of servers; I've been there as well, to some extent. This is a difficult task on its own.

  • You'd use service discovery to assign tags to hosts
  • And then puppet would distribute configuration.
    • How fast would it do that?
  • I'm not aware and am happy to learn if any of the mentioned solutions can be reconfigured by consul. You need to do the job yourself.

Also consider how flexible your tools are: suppose you reconfigure; how easy it is for your tools to reload and pick the new config?

  • orchestrator does its best, and reloads most configuration live, but even orchestrator has some "read-only" variables.

How easy it is to restart your HA services? Can you restart them in a rolling fashion, such that at any given point you have HA up and running?

  • This is well supported by orchestrator

orchestrator recognizes those crazy 5.7 -ROW-replication filters topologies by magic. Well, not magic. It just observes the state of the topologies. It learns the state of the topologies, so that at time of crash it has all the info. Its logic is based on that info. It knows and understands replication rules. It computes a good promotion path, taking data centers and configurations into consideration.

Initially, orchestrator started with blacklists in configuration. They're still supported. But today it's more about labels.

  • One of your replicas serves as a backup server. It runs extra tasks, or configured differently (no log_slave_updates? Different buffer pool settings?) and is not good to serve as a master. You should not promote that replica.

Instead of saying "this server is a backup server and should never be promoted" in configuration -- (and that's possible to do!), orchestrator lets you dynamically announce that this server should not be promoted. Such that maybe 10 minutes ago this wasn't a backup server, but now it is. You can advertise that fact to orchestrator. We do that via:

orchestrator -c register-candidate -i {::fqdn} --promotion-rule=${promotion_rule}

where ${promotion_rule} is candidate, neutral or must_not. We run this from cron, every couple minutes. How we choose the right rule comes from our service discovery. orchestrator is always up-to-date (up to a couple minutes) at worst of role changes (and urgent role changes get propagated immediately).

Also, each host can self-declare its role, so that orchestrator discovers the role on discover, as per DetectPromotionRuleQuery. Vitess is known to use this approach, and they contributed the code.

Network partitioning

First, about general failure detection.

orchestrator uses a holistic approach to detecting failures, and instead of justifying that it is a really good approach, I will testify that it is. orchestrator's accuracy in recognizing a failover scenario is very high. Booking.com has X,000 MySQL servers in production. at that scale, servers and networks break enough that there's always something breaking. Quoting (with permission) Simon J. Mudd from Booking.com:

We have at least one failure a day which it handles automatically. that’s really neat. people don’t care about dead masters any more.

orchestrator looks not only at the master, but at its replicas. If orchestrator can't see the master, but the replicas are happily running and replicating, then it's just orchestrator who can't see the master. But if orchestrator doesn't see the master`, and all the replicas are broken, then there's a failover scenario.

With ClusterControl, there seems to be danger of false positives: "...therefore it can take an action if there are network issues between the master and the ClusterControl host."

As for fencing:

MHA uses multiple nodes for detecting the failure. Especially if these nodes run on different DCs, this gives MHA a better view when fencing is involved.

Last September we gathered at an improvised BoF session on Percona Live Amsterdam. We were looking at observing a failure scenario where fencing is involved.

We mostly converged onto having multiple observers, such as 3 observers on 3 different data centers, a quorum of which would decide if there's a failure scenario or not.

The problem is difficult. If the master is seen by 2 nodes, but all of its replicas are broken, does that make a failure scenario or not? What if a couple replicas are happy but ten others are not?

orchestrator has it on the roadmap to run a quorum based failover decision. My estimation is that it will do the right thing most times and that it would be very difficult to push towards 99.99%.

Integration

orchestrator provides HTTP API as well as a command line interface. We use those at GitHub, for example, to integrate orchestrator into our chatops. We command orchestrator via chat and get information via chat.

Or we and others have automated, scheduled jobs, that use the orchestrator command line to rearrange topologies.

There is no direct integration between orchestrator and other tools. Recently there have been requests to integrate orchestrator with ProxySQL. I see multiple use cases for that. Plug: in the upcoming Percona Live conference in Santa Clara, René Cannaò and myself will co-present a BoF on potential ProxySQL-orchestrator integration. It will be an open discussion. Please come and share your thoughts! (The talk hasn't been scheduled yet, I will update this post with the link once it's scheduled).

Addressing the issue of read_only as an indicator to master/replica roles, please see the discussion on https://github.com/sysown/proxysql/issues/789. Hint: this isn't trivial and many times not reliable. It can work in some cases.

Conclusion

Is it time already? I have much to add; but let's stay focus on the SeveralNines blog. Addressing the comparison chart at the "Conclusion" section:

Replication support

orchestrator supports failovers on:

  • Oracle GTID
  • MariaDB GTID
  • Pseudo-GTID
  • Semi-sync
  • Binlog servers

GTID support is ongoing (recent example). Traditionally orchestrator is very focused on Pseudo-GTID.

I should add: Pseudo-GTID is a thing. It runs, it runs well. Pseudo-GTID provides with almost all the Oracle GTID advantages without actually using GTID, and without GTID limitations. The obvious disclaimer is that Pseudo-GTID has its own limitations, too. But I will leave the PSeudo-GTID preaching for another time, and just note that GitHub and Booking.com both run automated failovers based on Pseudo-GTID.

Flapping

Time based blocking per cluster, acknowledgements supported; human overrides permitted.

Lost transactions

No checking for transactions on master

Network Partitioning

Excellent false positive detection. Seems like other tools like MRM and ClusterControl are following up by adopting the orchestrator approach, and I'm happy for that.

Roles

State-based automatic detection and decision making; also dynamic roles via advertising; also support for self- declaring roles; also support for configuration based black lists

Integration

HTTP API, command line interfaces. No direct integration with other products. Looking into ProxySQL with no promises held.

Further

If you use non-GTID replication, MHA is the only option for you

That is incorrect. orchestrator is an excellent choice in my biased opinion. GitHub and Booking.com both run orchestrator automated failovers without using GTIDs.

Only ClusterControl ... is flexible enough to handle both types of GTID under one tool

That is incorrect. orchestrator supports both types of GTID. I'm further working towards better and better supports of Oracle GTID.

...this could be very useful if you have a mixed environment while you still would like to use one single tool to ensure high availability of your replication setup

Just to give you an impression: orchestrator works on topologies mixed with Oracle and MariaDB servers, normal replication and binlog servers, STATEMENT and ROW based replication, three major versions in the same cluster (as I recall I've seen that; I don't run this today), and mixed combination of the above. Truly.

Finally

Use whatever tool works best for you.

Oh, and I forgot!

Please consider attending my talk, Practical Orchestrator, where I will share practical advice and walkthough on `orchestrator setup.

 
Powered by Wordpress and MySQL. Theme by openark.org