Thoughts on MaxScale automated failover (and Orchestrator)

Having attended a talk (as part of the MariaDB Developer Meeting in Amsterdam) about recent developments of MaxScale in executing automated failovers, here are some (late) observations of mine.

I will begin by noting that the project is stated to be pre-production, and so of course none of the below are complaints, but rather food for thought, points for action and otherwise recommendations.

Some functionality of the MaxScale failover is also implemented by orchestrator, which I author. Orchestrator was built in production environments by and for operational people. In this respect it has gained many insights and had to cope with many real-world cases, special cases & Murphy’s law cases. This post compares logic, feature set and capabilities of the two where relevant. To some extent the below will read as “hey, I’ve already implemented this; shame to re-implement the same”, and indeed I think that way; but it wouldn’t be the first time a code of mine would just be re-implemented by someone else and I’ve done the same, myself.

I’m describing the solution the way I understood it from the talk. If I’m wrong on any account I’m happy to be corrected via comments below. Edit: please see comment by Dipti Joshi

General overview

The idea is that MaxScale operates as a proxy to your topology. You do not connect to your master directly, but rather through MaxScale. Thus, MaxScale acts as a proxy to your master.

The next phase is that MaxScale would also auto-detect master failure, fix the topology for you, promote a new master, and will have your application unaware of all the complexity and without the app having to change setup/DNS/whatever. Of course some write downtime is implied.

Now for some breakdown.

Detection

The detection of a dead master, the check by which a failover is initiated, is based on MaxScale not being able to query the master. This calls for some points for consideration:

Typically, I would see “I can’t connect to the master therefore failover” as too hysterical, and the basis for a lot of false positives.
However, since in the discussed configuration MaxScale is the only access point to the master, the fact MaxScale cannot connect to the master means the master is inaccessible de-facto.
In light of the above, the decision makes sense – but I still hold that it would make false positives.
I’m unsure (I think not; can anyone comment?) if MaxScale would make multiple attempts over time and only reach the conclusion after X successive failures. This would reduce the false positives.
I’m having a growing dislike to a “check for 4 successive times then alert/failover” Nagios-style behavior. Orchestrator takes a different approach where it recognizes a master’s death by not being able to connect to the master as well as being able to connect to 1st tier slaves, check their status and observe that they’re unable to connect to the master as well. See What makes a MySQL server failure/recovery case?. This approach still calls for further refinement (what if the master is temporarily deadlocked? Is this a failover or not?).

Assumed topology

MaxScale assumes the topology is all MariaDB, and all slaves are using (MariaDB) GTID replication. Well, MaxScale does not actually assumes that. It is assumed so by the MariaDB Replication Manager which MaxScale invokes. But I’m getting ahead of myself here.

Topology detection

MaxScale does not recognize the master by configuration but rather by state. It observes the servers it should observe, and concludes which is the master.

I’m using similar approach in orchestrator. I maintain that this approach works well and opens the Chakras for complex recovery options.

Upon failure detection

When MaxScale detects failure, it invokes external scripts to fix the problem. There are some similar and different particulars here as compared to orchestrator, and I will explain what’s wrong with the MaxScale approach:

Although MaxScale observes the topology and understands who is the master and who isn’t, the executed scripts do not. They need to re-discover everything by themselves.
This implies the scripts start without memory of “what was last observed”. This is one of the greatest strengths of orchestrator: it knows what the state was just before the failure, and, having the bigger picture, can make informed decisions.
- As a nasty example, what do you do when some the first tier slaves also happen to be inaccessible at that time? What if one of those happens to further have slaves of its own?
The MariaDB Replication Manager script (to be referenced as repmgr) assumes all instances to be MariaDB with GTID.
- It is also implied that all my slaves are configured with binary logs & log-slave-updates
- That’s way too restrictive.
  - Orchestrator handles all following topologies: Oracle MySQL with/out GTID, MariaDB with/out GTID, MariaDB hybrid GTID & non-GTID replication, Pseudo-GTID (MySQL and/or MariaDB), hybrid normal & binlog servers topologies, slaves with/out log-slave-updates, hybrid Oracle & MariaDB & Binlog Servers & Pseudo-GTID.
repmgr is unaware of data centers & physical environments. You want failover to be as local to your datacenters as possible. Avoid too many cross-DC replication streams.

Failover invocation

MaxScale invokes the failover scripts asynchronously. This is a major flaw imho, as the decoupling between the monitoring and acting processes leads to further problems, see further.

After failover

MaxScale continuously scans the topology and observes that some other server has been promoted. This behavior is similar to orchestrator’s. But the following differences are noteworthy:

Because of both the decoupling as well as the asynchronous invocation by MaxScale, it doesn’t really have any idea if and how the promotion resolved.
I don’t know that there’s any anti-flapping mechanism, nor that there could be. If MaxScale doesn’t care what happened to the failover script, it shouldn’t be able to keep up with flapping scenarios.
Nor is there a minimal suspend period between any two failure recoveries, that I know of. MaxScale can actually have easier life than orchestrator in this regard as it is (I suspect) strictly associated with a topology. Not like there’s a single MaxScale handling multiple topologies. So it should be very easy to keep track of failures.
Or, if there is a minimal period and I’m just uninformed — what makes sure it is not smaller than the time it takes for the failover?

Further on failover

I wish to point out that one component of the system analyses a failure scenario, and another one fixes it. I suggest this is an undesired design. The “fixer” must have its own ability to diagnose problems as it makes progress (or else it is naive and would fail in many production cases). And the “analyzer” part must have some wisdom of its own so as to suggest course of action; or understand the consequences of the recovery done by the “fixer”.

Use of shell scripts

Generally speaking, the use of shell scripts as external hooks is evil:

Shell scripts tend to be poorly audited
With poor clarity as for what went wrong
Killing them has operational difficulty (detect the shell script, find possible children, detached children)
The approach of “if you want something else, just write a shell script for it” is nice for some things, but as the problem turns complex, you turn out to just write big parts of the solution in shell. This decouples your code to unwanted degree.

At this time, orchestrator also uses external hooks. However:

Fixing the topology happens within orchestrator, not by external scripts. There is an elaborate, auditable, visible decision making.
- Decision making includes data center considerations, different configuration of servers involved, servers hinted as candidates, servers configured to be ignored, servers known to be downtimed.
Leaving the external scripts with the task of managing DNS changes or what have you.
- Today, at Booking.com, we have a special operational tool (called the dba tool) which does that, manages rosters, issues puppet etc. This tool is itself well audited. Granted, there is still decoupling, but information does not just get lost.
- Sometime in the future I suspect I will extend orchestrator-agent to participate in failovers, which means the entire flow is within orchestrator’s scope.

High availability

All the above is only available via a single MaxScale server. What happens if it dies?

There is a MaxScale/pacemaker setup I’m aware of. If one MaxScale dies, pacemaker takes charge and starts another on another box.

But this means real downtime
There are no multiple-MaxScale servers to load-balance on
The MaxScale started by pacemaker is newly born, and does not have the big picture of the topology. It needs to go through a “peaceful time” to understand what’s going on.

More High Availability

At a time where MaxScale will be able to load-balance and run on multiple nodes, MariaDB will have to further tackle:

Leader election
Avoiding concurrent initiation of failovers
- Either via group communication
- Or via shared datastore
Taking off from a failed/crashed MaxScale server’s work
- Or rolling it back
- Or just cleaning it up
And generally share all those little pieces of information, such as “Hey, now this server is the master” (are all MaxScales in complete agreement on the topology?) or “I have failed over this topology, we should avoid failing it over again for the next 10 minutes” and more.

The above are supported by orchestrator. It provides leader election, automated leader promotion, fair recognition of various failure scenarios, picking up a failed recovery from a failed orchestrator. Data is shared by a backend MySQL datastore, and before you shout SPOF, make it Galera/NDB.

Further little things that can ruin your day

How about having a delayed replica?

Here’s an operational use case we had to tackle.

You have a slave configured to lag by 24 hours. You know the drill: hackers / accidental DROP TABLE…
How much time will an automated tool spend on reconnecting this slave to the topology?
- This could take long minutes
- Will your recovery hang till this is resolved?
Since orchestrator heals the topology in-house, it knows how to push certain operations till after specific other operations took place. For example, orchestrator wants to heal the entire topology, but pushes the delayed replicas aside, under the assumption that it will be able to fix them later (fair assumption, because they are known to be behind our promoted master); it will proceed to fix everything else, execute external hooks (change DNS etc.) and only then come back to the delayed replica. All the while, the process is audited.

Flapping ruins your day

Not only do you want some stall period between two failovers, you also want your team to respond to a failover and acknowledge it. Or clear up the stall period having verified the source of the problem. Or force the next failover even if it comes sooner than the stall period termination.

Binlog formats

It is still not uncommon to have Statement Based Replication running. And then it is also not uncommon to have one or two slaves translating to Row Based Replication because of:

Some app that has to read ROW based format
Experimenting with RBR for purposes of upgrade

You just can’t promote such a RBR slave on top of SBR slaves; it wouldn’t work. Orchestrator is aware of such rules. I still need to integrate this particular consideration into the promotion algorithm.

Versions

Likewise, not all your slaves are of same version. You should not promote a newer version slave on top of an older version slave. Again, orchestrator will not allow putting such a topology, and again, I still need to integrate this consideration into the promotion algorithm.

In summary

There is a long way for MaxScale failover to go. When you consider the simplest, all-MariaDB-GTID-equal-slaves small topology case, things are kept simple and probably sustainable. But issues like complex topologies, flapping, special slaves, different configuration, high availability, leadership, acknowledgements, and more, call for a more advanced solution.