Failover – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Thu, 03 Aug 2017 13:32:59 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 orchestrator/raft: Pre-Release 3.0 https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0 https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0#comments Thu, 03 Aug 2017 08:41:11 +0000 https://shlomi-noach.github.io/blog/?p=7740 orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script.

TL;DR

You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator‘s backend MySQL servers (such as Galera/InnoDB Cluster). In fact, you can run a orchestrator/raft setup using embedded SQLite backend DB. Read on.

orchestrator still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to 3.0 and do not configure raft.

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLite is supported. orchestrator now ships with SQLite embedded, no external dependency needed.

In a orchestrator/raft setup, all orchestrator nodes talk to each other. One and only one is elected as leader. To become a leader a node must be part of a quorum. On a 3 node setup, it takes 2 nodes to form a quorum. On a 5 node setup, it takes 3 nodes to form a quorum.

Only the leader will run failovers. This much is similar to the existing shared-backend DB setup. However in a orchestrator/raft setup each node is independent, and each orchestrator node runs discoveries. This means a MySQL server in your topology will be routinely visited and probed by not one orchestrator node, but by all 3 (or 5, or what have you) nodes in your raft cluster.

Any communication to orchestrator must take place through the leader. One may not tamper directly with the backend DBs anymore, since the leader is the one authoritative entity to replicate and announce changes to its peer nodes. See orchestrator-client section following.

For details, please refer to the documentation:

The orchetrator/raft setup comes to solve several issues, the most obvious is high availability for the orchestrator service: in a 3 node setup any single orchestrator node can go down and orchestrator will reliably continue probing, detecting failures and recovering from failures.

Another issue solve by orchestrator/raft is network isolation, in particularly cross-DC, also refered to as fencing. Some visualization will help describe the issue.

Consider this 3 data-center replication setup. The master, along with a few replicas, resides on DC1. Two additional DCs have intermediate masters, aka local-masters, that relay replication to local replicas.

We place 3 orchestrator nodes in a raft setup, each in a different DC. Note that traffic between orchestrator nodes is very low, and cross DC latencies still conveniently support the raft communication. Also note that backend DB writes have nothing to do with cross-DC traffic and are unaffected by latencies.

Consider what happens if DC1 gets network isolated: no traffic in or out DC1

Each orchestrator nodes operates independently, and each will see a different state. DC1’s orchestrator will see all servers in DC2, DC3 as dead, but figure the master itself is fine, along with its local DC1 replicas:

However both orchestrator nodes in DC2 and DC3 will see a different picture: they will see all DC1’s servers as dead, with local masters in DC2 and DC3 having broken replication:

Who gets to choose?

In the orchestrator/raft setup, only the leader runs failovers. The leader must be part of a quorum. Hence the leader will be an orchestrator node in either DC2 or DC3. DC1’s orchestrator will know it is isolated, that it isn’t part of the quorum, hence will step down from leadership (that’s the premise of the raft consensus protocol), hence will not run recoveries.

There will be no split brain in this scenario. The orchestrator leader, be it in DC2 or DC3, will act to recover and promote a master from within DC2 or DC3. A possible outcome would be:

What if you only have 2 data centers?

In such case it is advisable to put two orchestrator nodes, one in each of your DCs, and a third orchestrator node as a mediator, in a 3rd DC, or in a different availability zone. A cloud offering should do well:

The orchestrator/raft setup plays nice and allows one to nominate a preferred leader.

SQLite

Suggested and requested by many, is to remove orchestrator‘s own dependency on a MySQL backend. orchestrator now supports a SQLite backend.

SQLite is a transactional, relational, embedded database, and as of 3.0 it is embedded within orchestrator, no external dependency required.

SQLite doesn’t replicate, doesn’t support client/server protocol. As such, it cannot work as a shared database backend. SQLite is only available on:

  • A single node setup: good for local dev installations, testing server, CI servers (indeed, SQLite now runs in orchestrator‘s CI)
  • orchestrator/raft setup, where, as noted above, backend DBs do not communicate with each other in the first place and are each dedicated to their own orchestrator node.

It should be pointed out that SQLite is a great transactional database, however MySQL is more performant. Load on backend DB is directly (and mostly linearly) affected by the number of probed servers. If you have 50 servers in your topologies or 500 servers, that matters. The probing frequency of course also matters for the write frequency on your backend DB. I would suggest if you have thousands of backend servers, to stick with MySQL. If dozens, SQLite should be good to go. In between is a gray zone, and at any case run your own experiments.

At this time SQLite is configured to commit to file; there is a different setup where SQLite places data in-memory, which makes it faster to execute. Occasional dumps required for durability. orchestrator may support this mode in the future.

orchestrator-client

You install orchestrator as a service on a few boxes; but then how do you access it from other hosts?

  • Either curl the orchestrator API
  • Or, as most do, install orchestrator-cli package, which includes the orchestrator binary, everywhere.

The latter implies:

  • Having the orchestrator binary installed everywhere, hence updated everywhere.
  • Having the /etc/orchestrator.conf.jsondeployed everywhere, along with credentials.

The orchestrator/raft setup does not support running orchestrator in command-line mode. Reason: in this mode orchestrator talks directly to the shared backend DB. There is no shared backend DB in the orchestrator/raft setup, and all communication must go through the leader service. This is a change of paradigm.

So, back to curling the HTTP API. Enter orchestrator-client which mimics the command line interface, while running curl | jq requests against the HTTP API. orchestrator-client, however, is just a shell script.

orchestrator-client will work well on either orchestrator/raft or on your existing non-raft setups. If you like, you may replace your remote orchestrator installations and your /etc/orchestrator.conf.json deployments with this script. You will need to provide the script with a hint: the $ORCHESTRATOR_API environment variable should be set to point to the orchestrator HTTP API.

Here’s the fun part:

  • You will either have a proxy on top of your orchestrator service cluster, and you would export ORCHESTRATOR_API=http://my.orchestrator.service/api
  • Or you will provide orchestrator-client with all orchestrator node identities, as in export ORCHESTRATOR_API="https://orchestrator.host1:3000/api https://orchestrator.host2:3000/api https://orchestrator.host3:3000/api" .
    orchestrator-client will figure the identity of the leader and will forward requests to the leader. At least scripting-wise, you will not require a proxy.

Status

orchestrator 3.0 is a Pre-Release. We are running a mostly-passive orchestrator/raft setup in production. It is mostly-passive in that it is not in charge of failovers yet. Otherwise it probes and analyzes our topologies, as well as runs failure detection. We will continue to improve operational aspects of the orchestrator/raft setup (see this issue).

 

]]>
https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0/feed 4 7740
State of automated recovery via Pseudo-GTID & Orchestrator @ Booking.com https://shlomi-noach.github.io/blog/mysql/state-of-automated-recovery-via-pseudo-gtid-orchestrator-booking-com https://shlomi-noach.github.io/blog/mysql/state-of-automated-recovery-via-pseudo-gtid-orchestrator-booking-com#respond Fri, 20 Nov 2015 09:41:13 +0000 https://shlomi-noach.github.io/blog/?p=7453 This post sums up some of my work on MySQL resilience and high availability at Booking.com by presenting the current state of automated master and intermediate master recoveries via Pseudo-GTID & Orchestrator.

Booking.com uses many different MySQL topologies, of varying vendors, configurations and workloads: Oracle MySQL, MariaDB, statement based replication, row based replication, hybrid, OLTP, OLAP, GTID (few), no GTID (most), Binlog Servers, filters, hybrid of all the above.

Topologies size varies from a single server to many-many-many. Our typical topology has a master in one datacenter, a bunch of slaves in same DC, a slave in another DC acting as an intermediate master to further bunch of slaves in the other DC. Something like this, give or take:

booking-topology-sample

However as we are building our third data center (with MySQL deployments mostly completed) the graph turns more complex.

Two high availability questions are:

  • What happens when an intermediate master dies? What of all its slaves?
  • What happens when the master dies? What of the entire topology?

This is not a technical drill down into the solution, but rather on overview of the state. For more, please refer to recent presentations in September and April.

At this time we have:

  • Pseudo-GTID deployed on all chains
  • Pseudo-GTID based automated failover for intermediate masters on all chains
  • Pseudo-GTID based automated failover for masters on roughly 30% of the chains.
    • The rest of 70% of chains are set for manual failover using Pseudo-GTID.

Pseudo-GTID is in particular used for:

  • Salvaging slaves of a dead intermediate master
  • Correctly grouping and connecting slaves of a dead master
  • Routine refactoring of topologies. This includes:
    • Manual repointing of slaves for various operations (e.g. offloading slaves from a busy box)
    • Automated refactoring (for example, used by our automated upgrading script, which consults with orchestrator, upgrades, shuffles slaves around, updates intermediate master, suffles back…)
  • (In the works), failing over binlog reader apps that audit our binary logs.

Furthermore, Booking.com is also working on Binlog Servers:

  • These take production traffic and offload masters and intermediate masters
  • Often co-serve slaves using round-robin VIP, such that failure of one Binlog Server makes for simple slave replication self-recovery.
  • Are interleaved alongside standard replication
    • At this time we have no “pure” Binlog Server topology in production; we always have normal intermediate masters and slaves
  • This hybrid state makes for greater complexity:
    • Binlog Servers are not designed to participate in a game of changing masters/intermediate master, unless successors come from their own sub-topology, which is not the case today.
      • For example, a Binlog Server that replicates directly from the master, cannot be repointed to just any new master.
      • But can still hold valuable binary log entries that other slaves may not.
    • Are not actual MySQL servers, therefore of course cannot be promoted as masters

Orchestrator & Pseudo-GTID makes this hybrid topology still resilient:

  • Orchestrator understands the limitations on the hybrid topology and can salvage slaves of 1st tier Binlog Servers via Pseudo-GTID
  • In the case where the Binlog Servers were the most up to date slaves of a failed master, orchestrator knows to first move potential candidates under the Binlog Server and then extract them out again.
  • At this time Binlog Servers are still unstable. Pseudo-GTID allows us to comfortably test them on a large setup with reduced fear of losing slaves.

Otherwise orchestrator already understands pure Binlog Server topologies and can do master promotion. When pure binlog servers topologies will be in production orchestrator will be there to watch over.

Summary

To date, Pseudo-GTID has high scores in automated failovers of our topologies; orchestrator’s holistic approach makes for reliable diagnostics; together they reduce our dependency on specific servers & hardware, physical location, latency implied by SAN devices.

]]>
https://shlomi-noach.github.io/blog/mysql/state-of-automated-recovery-via-pseudo-gtid-orchestrator-booking-com/feed 0 7453
Orchestrator & Pseudo-GTID for binlog reader failover https://shlomi-noach.github.io/blog/mysql/orchestrator-pseudo-gtid-for-binlog-reader-failover https://shlomi-noach.github.io/blog/mysql/orchestrator-pseudo-gtid-for-binlog-reader-failover#respond Thu, 19 Nov 2015 08:52:16 +0000 https://shlomi-noach.github.io/blog/?p=7446 One of our internal apps at Booking.com audits changes to our tables on various clusters. We used to use tungsten replicator, but have since migrated onto our own solution.

We have a binlog reader (uses open-replicator) running on a slave. It expects Row Based Replication, hence our slave runs with log-slave-updates, binlog-format=’ROW’, to translate from the master’s Statement Based Replication. The binlog reader reads what it needs to read, audits what it needs to audit, and we’re happy.

However what happens if that slave dies?

In such case we need to be able to point our binlog reader to another slave, and it needs to be able to pick up auditing from the same point.

This sounds an awful lot like slave repointing in case of master/intermediate master failure, and indeed the solutions are similar. However our binlog reader is not a real MySQL server and does not understands replication. It does not really replicate, it just parses binary logs.

We’re also not using GTID. But we are using Pseudo-GTID. As it turns out, the failover solution is already built in by orchestrator, and this is how it goes:

Normal execution

Our binlog app reads entries from the binary log. Some are of interest for auditing purposes, some are not. An occasional Pseudo-GTID entry is found, and is being stored to ZooKeeper tagged as  “last seen and processed Pseudo-GTID”.

Upon slave failure

We recognize the death of a slave; we have other slaves in the pool; we pick another. Now we need to find the coordinates from which to carry on.

We read our “last seen and processed Pseudo-GTID”. Say it reads:

drop view if exists `meta`.`_pseudo_gtid_hint__asc:56373F17:00000000012B1C8B:50EC77A1`

. We now issue:

$ orchestrator -c find-binlog-entry -i new.slave.fqdn.com --pattern='drop view if exists `meta`.`_pseudo_gtid_hint__asc:56373F17:00000000012B1C8B:50EC77A1`'

The output of such command are the binlog coordinates of that same entry as found in the new slave’s binlogs:

binlog.000148:43664433

Pseudo-GTID entries are only injected once every few seconds (5 in our case). Either:

  • We are OK to reprocess up to 5 seconds worth of data (and indeed we are, our mechanism is such that this merely overwrites our previous audit, no corruption happens)
  • Or our binlog reader also keeps track of the number of events since the last processed Pseudo-GTID entry, skipping the same amount of events after failing over.

Planned failover

In case we plan to repoint our binlog reader to another slave, we can further use orchestrator’s power in making an exact correlation between the binlog positions of two slaves. This has always been within its power, but only recently exposed as it own command. We can, at any stage:

$ sudo orchestrator -c correlate-binlog-pos -i current.instance.fqdn.com --binlog=binlog.002011:72656109 -d some.other.instance.fqdn.com

The output is the binlog coordinates in some.other.instance.fqdn.com that exactly correlate with binlog.002011:72656109 in current.instance.fqdn.com

The case of failure of the binlog reader itself is also handled, but is not the subject of this blog post.

]]>
https://shlomi-noach.github.io/blog/mysql/orchestrator-pseudo-gtid-for-binlog-reader-failover/feed 0 7446
Thoughts on MaxScale automated failover (and Orchestrator) https://shlomi-noach.github.io/blog/mysql/thoughts-on-maxscale-automated-failover-and-orchestrator https://shlomi-noach.github.io/blog/mysql/thoughts-on-maxscale-automated-failover-and-orchestrator#comments Wed, 18 Nov 2015 09:17:48 +0000 https://shlomi-noach.github.io/blog/?p=7439 Having attended a talk (as part of the MariaDB Developer Meeting in Amsterdam) about recent developments of MaxScale in executing automated failovers, here are some (late) observations of mine.

I will begin by noting that the project is stated to be pre-production, and so of course none of the below are complaints, but rather food for thought, points for action and otherwise recommendations.

Some functionality of the MaxScale failover is also implemented by orchestrator, which I author. Orchestrator was built in production environments by and for operational people. In this respect it has gained many insights and had to cope with many real-world cases, special cases & Murphy’s law cases. This post compares logic, feature set and capabilities of the two where relevant. To some extent the below will read as “hey, I’ve already implemented this; shame to re-implement the same”, and indeed I think that way; but it wouldn’t be the first time a code of mine would just be re-implemented by someone else and I’ve done the same, myself.

I’m describing the solution the way I understood it from the talk. If I’m wrong on any account I’m happy to be corrected via comments below. Edit: please see comment by Dipti Joshi

General overview

The idea is that MaxScale operates as a proxy to your topology. You do not connect to your master directly, but rather through MaxScale. Thus, MaxScale acts as a proxy to your master.

The next phase is that MaxScale would also auto-detect master failure, fix the topology for you, promote a new master, and will have your application unaware of all the complexity and without the app having to change setup/DNS/whatever. Of course some write downtime is implied.

Now for some breakdown.

Detection

The detection of a dead master, the check by which a failover is initiated, is based on MaxScale not being able to query the master. This calls for some points for consideration:

  • Typically, I would see “I can’t connect to the master therefore failover” as too hysterical, and the basis for a lot of false positives.
  • However, since in the discussed configuration MaxScale is the only access point to the master, the fact MaxScale cannot connect to the master means the master is inaccessible de-facto.
  • In light of the above, the decision makes sense – but I still hold that it would make false positives.
  • I’m unsure (I think not; can anyone comment?) if MaxScale would make multiple attempts over time and only reach the conclusion after X successive failures. This would reduce the false positives.
  • I’m having a growing dislike to a “check for 4 successive times then alert/failover” Nagios-style behavior. Orchestrator takes a different approach where it recognizes a master’s death by not being able to connect to the master as well as being able to connect to 1st tier slaves, check their status and observe that they’re unable to connect to the master as well. See What makes a MySQL server failure/recovery case?. This approach still calls for further refinement (what if the master is temporarily deadlocked? Is this a failover or not?).

Assumed topology

MaxScale assumes the topology is all MariaDB, and all slaves are using (MariaDB) GTID replication. Well, MaxScale does not actually assumes that. It is assumed so by the MariaDB Replication Manager which MaxScale invokes. But I’m getting ahead of myself here.

Topology detection

MaxScale does not recognize the master by configuration but rather by state. It observes the servers it should observe, and concludes which is the master.

I’m using similar approach in orchestrator. I maintain that this approach works well and opens the Chakras for complex recovery options.

Upon failure detection

When MaxScale detects failure, it invokes external scripts to fix the problem. There are some similar and different particulars here as compared to orchestrator, and I will explain what’s wrong with the MaxScale approach:

  • Although MaxScale observes the topology and understands who is the master and who isn’t, the executed scripts do not. They need to re-discover everything by themselves.
  • This implies the scripts start without memory of “what was last observed”. This is one of the greatest strengths of orchestrator: it knows what the state was just before the failure, and, having the bigger picture, can make informed decisions.
    • As a nasty example, what do you do when some the first tier slaves also happen to be inaccessible at that time? What if one of those happens to further have slaves of its own?
  • The MariaDB Replication Manager script (to be referenced as repmgr) assumes all instances to be MariaDB with GTID.
    • It is also implied that all my slaves are configured with binary logs & log-slave-updates
    • That’s way too restrictive.
      • Orchestrator handles all following topologies: Oracle MySQL with/out GTID, MariaDB with/out GTID, MariaDB hybrid GTID & non-GTID replication, Pseudo-GTID (MySQL and/or MariaDB), hybrid normal & binlog servers topologies, slaves with/out log-slave-updates, hybrid Oracle & MariaDB & Binlog Servers & Pseudo-GTID.
  • repmgr is unaware of data centers & physical environments. You want failover to be as local to your datacenters as possible. Avoid too many cross-DC replication streams.

Failover invocation

MaxScale invokes the failover scripts asynchronously. This is a major flaw imho, as the decoupling between the monitoring and acting processes leads to further problems, see further.

After failover

MaxScale continuously scans the topology and observes that some other server has been promoted. This behavior is similar to orchestrator’s. But the following differences are noteworthy:

  • Because of both the decoupling as well as the asynchronous invocation by MaxScale, it doesn’t really have any idea if and how the promotion resolved.
  • I don’t know that there’s any anti-flapping mechanism, nor that there could be. If MaxScale doesn’t care what happened to the failover script, it shouldn’t be able to keep up with flapping scenarios.
  • Nor is there a minimal suspend period between any two failure recoveries, that I know of. MaxScale can actually have easier life than orchestrator in this regard as it is (I suspect) strictly associated with a topology. Not like there’s a single MaxScale handling multiple topologies. So it should be very easy to keep track of failures.
  • Or, if there is a minimal period and I’m just uninformed — what makes sure it is not smaller than the time it takes for the failover?

Further on failover

I wish to point out that one component of the system analyses a failure scenario, and another one fixes it. I suggest this is an undesired design. The “fixer” must have its own ability to diagnose problems as it makes progress (or else it is naive and would fail in many production cases). And the “analyzer” part must have some wisdom of its own so as to suggest course of action; or understand the consequences of the recovery done by the “fixer”.

Use of shell scripts

Generally speaking, the use of shell scripts as external hooks is evil:

  • Shell scripts tend to be poorly audited
  • With poor clarity as for what went wrong
  • Killing them has operational difficulty (detect the shell script, find possible children, detached children)
  • The approach of “if you want something else, just write a shell script for it” is nice for some things, but as the problem turns complex, you turn out to just write big parts of the solution in shell. This decouples your code to unwanted degree.

At this time, orchestrator also uses external hooks. However:

  • Fixing the topology happens within orchestrator, not by external scripts. There is an elaborate, auditable, visible decision making.
    • Decision making includes data center considerations, different configuration of servers involved, servers hinted as candidates, servers configured to be ignored, servers known to be downtimed.
  • Leaving the external scripts with the task of managing DNS changes or what have you.
    • Today, at Booking.com, we have a special operational tool (called the dba tool) which does that, manages rosters, issues puppet etc. This tool is itself well audited. Granted, there is still decoupling, but information does not just get lost.
    • Sometime in the future I suspect I will extend orchestrator-agent to participate in failovers, which means the entire flow is within orchestrator’s scope.

High availability

All the above is only available via a single MaxScale server. What happens if it dies?

There is a MaxScale/pacemaker setup I’m aware of. If one MaxScale dies, pacemaker takes charge and starts another on another box.

  • But this means real downtime
  • There are no multiple-MaxScale servers to load-balance on
  • The MaxScale started by pacemaker is newly born, and does not have the big picture of the topology. It needs to go through a “peaceful time” to understand what’s going on.

More High Availability

At a time where MaxScale will be able to load-balance and run on multiple nodes, MariaDB will have to further tackle:

  • Leader election
  • Avoiding concurrent initiation of failovers
    • Either via group communication
    • Or via shared datastore
  • Taking off from a failed/crashed MaxScale server’s work
    • Or rolling it back
    • Or just cleaning it up
  • And generally share all those little pieces of information, such as “Hey, now this server is the master” (are all MaxScales in complete agreement on the topology?) or “I have failed over this topology, we should avoid failing it over again for the next 10 minutes” and more.

The above are supported by orchestrator. It provides leader election, automated leader promotion, fair recognition of various failure scenarios, picking up a failed recovery from a failed orchestrator. Data is shared by a backend MySQL datastore, and before you shout SPOF, make it Galera/NDB.

Further little things that can ruin your day

How about having a delayed replica?

Here’s an operational use case we had to tackle.

  • You have a slave configured to lag by 24 hours. You know the drill: hackers / accidental DROP TABLE
  • How much time will an automated tool spend on reconnecting this slave to the topology?
    • This could take long minutes
    • Will your recovery hang till this is resolved?
  • Since orchestrator heals the topology in-house, it knows how to push certain operations till after specific other operations took place. For example, orchestrator wants to heal the entire topology, but pushes the delayed replicas aside, under the assumption that it will be able to fix them later (fair assumption, because they are known to be behind our promoted master); it will proceed to fix everything else, execute external hooks (change DNS etc.) and only then come back to the delayed replica. All the while, the process is audited.

Flapping ruins your day

  • Not only do you want some stall period between two failovers, you also want your team to respond to a failover and acknowledge it. Or clear up the stall period having verified the source of the problem. Or force the next failover even if it comes sooner than the stall period termination.

Binlog formats

It is still not uncommon to have Statement Based Replication running. And then it is also not uncommon to have one or two slaves translating to Row Based Replication because of:

  • Some app that has to read ROW based format
  • Experimenting with RBR for purposes of upgrade

You just can’t promote such a RBR slave on top of SBR slaves; it wouldn’t work. Orchestrator is aware of such rules. I still need to integrate this particular consideration into the promotion algorithm.

Versions

Likewise, not all your slaves are of same version. You should not promote a newer version slave on top of an older version slave. Again, orchestrator will not allow putting such a topology, and again, I still need to integrate this consideration into the promotion algorithm.

In summary

There is a long way for MaxScale failover to go. When you consider the simplest, all-MariaDB-GTID-equal-slaves small topology case, things are kept simple and probably sustainable. But issues like complex topologies, flapping, special slaves, different configuration, high availability, leadership, acknowledgements, and more, call for a more advanced solution.

]]>
https://shlomi-noach.github.io/blog/mysql/thoughts-on-maxscale-automated-failover-and-orchestrator/feed 6 7439
What makes a MySQL server failure/recovery case? https://shlomi-noach.github.io/blog/mysql/what-makes-a-mysql-server-failurerecovery-case https://shlomi-noach.github.io/blog/mysql/what-makes-a-mysql-server-failurerecovery-case#comments Sat, 25 Jul 2015 07:00:03 +0000 https://shlomi-noach.github.io/blog/?p=7274 Or: How do you reach the conclusion your MySQL master/intermediate-master is dead and must be recovered?

This is an attempt at making a holistic diagnosis of our replication topologies. The aim is to cover obvious and not-so-obvious crash scenarios, and to be able to act accordingly and heal the topology.

At Booking.com we are dealing with very large amounts of MySQL servers. We have many topologies, and many servers in each topology. See past numbers to get a feel for it. At these numbers failures happen frequently. Typically we would see normal slaves failing, but occasionally — and far more frequently than we would like to be paged for — an intermediate master or a master would crash. But our current (and ever in transition) setup also include SANs, DNS records, VIPs, any of which can fail and bring down our topologies.

Tackling issues of monitoring, disaster analysis and recovery processes, I feel safe to claim the following statements:

  • The fact your monitoring tool cannot access your database does not mean your database has failed.
  • The fact your monitoring tool can access your database does not mean your database is available.
  • The fact your database master is unwell does not mean you should fail over.
  • The fact your database master is alive and well does not mean you should not fail over.

Bummer. Let’s review a simplified topology with a few failure scenarios. Some of these scenarios you will find familiar. Some others may be caused by setups you’re not using. I would love to say I’ve seen it all but the more I see the more I know how strange things can become.

We will consider the simplified case of a master with three replicas: we have M as master, A, B, C as slaves.

mysql-topologies-failures

 

A common monitoring scheme is to monitor each machine’s IP, availability of MySQL port (3306) and responsiveness to some simple query (e.g. “SELECT 1”). Some of these checks may run local to the machine, others remote.

Now consider your monitoring tool fails to connect to your master.

mysql-topologies-failures (1)

I’ve marked the slaves with question marks as the common monitoring schema does not associate the master’s monitoring result to the slaves’.  Can you safely conclude your master is dead? Are your feeling comfortable with initiating a failover process? How about:

  • Temporary network partitioning; it just so happens that your monitoring tool cannot access the master, though everyone else can.
  • DNS/VIP/name cache/name resolving issue. Sometimes similar to the above; does you monitoring tool host think the master’s IP is what it really is? Has something just changed? Some cache expired? Some cache is stale?
  • MySQL connection rejection. This could be due to a serious “Too many connections” problem on the master, or due to accidental network noise.

Now consider the following case: a first tier slave is failing to connect to the master:

mysql-topologies-failures (2)

The slave’s IO thread is broken; do we have a problem here? Is the slave failing to connect because the master is dead, or because the slave itself suffers from a network partitioning glitch?

A holistic diagnosis

In the holistic approach we couple the master’s monitoring with that of its direct slaves. Before I continue to describe some logic, the previous statement is something we must reflect upon.

We should associate the master’s state with that of its direct slaves. Hence we must know which are its direct slaves. We might have slaves D, E, F, G replicating from B, C. They are not in our story. But slaves come and go. Get provisioned and de-provisioned. They get repointed elsewhere. Our monitoring needs to be aware of the state of our replication topology.

My preferred tool for the job is orchestrator, since I author it. It is not a standard monitoring tool and does not serve metrics; but it observes your topologies and records them. And notes changes. And acts as a higher level failure detection mechanism which incorporates the logic described below.

We continue our discussion under the assumption we are able to reliably claim we know our replication topology. Let’s revisit our scenarios from above and then add some.

We will further only require MySQL client protocol connection to our database servers.

Dead master

A “real” dead master is perhaps the clearest failure. MySQL has crashed (signal 11); or the kernel panicked; or the disks failed; or power went off. The server is really not serving. This is observed as:

mysql-topologies-failures (3)

In the holistic approach, we observe that:

  • We cannot reach the master (our MySQL client connection fails).
  • But we are able to connect to the slaves A, B, C
  • And A, B, C are all telling us they cannot connect to the master

We have now cross referenced the death of the master with its three slaves. Funny thing is the MySQL server on the master may still be up and running. Perhaps the master is suffering from some weird network partitioning problem (when I say “weird”, I mean we have it; discussed further below). And perhaps some application is actually still able to talk to the master!

And yet our entire replication topology is broken. Replication is not there for beauty; it serves our application code. And it’s turning stale. Even if by some chance things are still operating on the master, this still makes for a valid failover scenario.

Unreachable master

Compare the above with:

mysql-topologies-failures (4)

Our monitoring scheme cannot reach our master. But it can reach the slaves, an they’re all saying: “I’m happy!”

This gives us suspicion enough to avoid failing over. We may not actually have a problem: it’s just us that are unable to connect to the master.

Right?

There are still interesting use cases. Consider the problem of “Too many connections” on the master. You are unable to connect; the application starts throwing errors; but the slaves are happy. They were there first. They started replicating at the dawn of time, long before there was an issue. Their persistent connections are good to go.

Or the master may suffer a deadlock. A long, blocking ALTER TABLE. An accidental FLUSH TABLES WITH READ LOCK. Or whatever occasional bug we hit. Slaves are still connected; but new connections are hanging; and your monitoring query is unable to process.

And still our holistic approach can find that out: as we are able to connect to our slaves, we are also able to ask them: well what have your relay logs have to say about this? Are we progressing in replication position? Do we actually find application content in the slaves’ relay logs? We can do all this via MySQL protocol (“SHOW SLAVE STATUS”, “SHOW RELAYLOG EVENTS”).

Understanding the topology gives you greater insight into your failure case; you have increasing leevels of confidentiality in your analysis. Strike that: in your automated analysis.

Dead master and slaves

They’re all gone!

mysql-topologies-failures (5)

You cannot reach the master and you cannot reach any of its slaves. Once you are able to associate your master and slaves you can conclude you either have a complete DC power failure problem (or is this cross DC?) or you are having a network partitioning problem. Your application may or may not be affected — but at least you know where to start. Compare with:

Failed DC

mysql-topologies-failures (6)

I’m stretching it now, because when a DC fails all the red lights start flashing. Nonetheless, if M, A, B are all in one DC and C is on another, you have yet another diagnosis.

Dead master and some slaves

mysql-topologies-failures (7)

Things start getting complicated when you’re unable to get an authorized answer from everyone. What happens if the master is dead as well as one of its slaves? We previously expected all slaves to say “we cannot replicate”. For us, master being unreachable, some slaves being dead and all other complaining on IO thread is good enough indication that the master is dead.

All first tier slaves not replicating

mysql-topologies-failures (9)

Not a failover case, but certainly needs to ring the bells. All master’s direct slaves are failing replication on some SQL error or are just stopped. Our topology is turning stale.

Intermediate masters

With intermediate master the situation is not all that different. In the below:

Untitled presentation

The servers E, F, G replicating from C provide us with the holistic view on C. D provides the holistic view on A.

Reducing noise

Intermediate master failover is a much simpler operation than master failover. Changing masters require name resolve changes (of some sort), whereas moving slaves around the topology affects no one.

This implies:

  • We don’t mind over-reacting on failing over intermediate masters
  • We pay with more noise

Sure, we don’t mind failing over D elsewhere, but as D is the only slave of A, it’s enough that D hiccups that we might get an alert (“all” intermediate master’s slaves are not replicating). To that effect orchestrator treats single slave scenarios differently than multiple slaves scenarios.

Not so fun setups and failures

At Booking.com we are in transition between setups. We have some legacy configuration, we have a roadmap, two ongoing solutions, some experimental setups, and/or all of the above combined. Sorry.

Some of our masters are on SAN. We are moving away from this; for those masters on SANs we have cold standbys in an active-passive mode; so master failure -> unmount SAN -> mount SAN on cold standby -> start MySQL on cold standby -> start recovery -> watch some TV -> go shopping -> end recovery.

Only SANs fail, too. When the master fails, switching over to the cold standby is pointless if the origin of the problem is the SAN. And given that some other masters share the same SAN… whoa. As I said we’re moving away from this setup for Pseudo GTID and then for Binlog Servers.

The SAN setup also implied using VIPs for some servers. The slaves reference the SAN master via VIP, and when the cold standby start up it assumes the VIP, and the slaves know nothing about this. Same setup goes for DC masters. What happens when the VIP goes down? MySQL is running happily, but slaves are unable to connect. Does that make for a failover scenario? For intermediate masters we’re pushing it to be so, failing over to a normal local-disk based server; this improves out confidence in non-SAN setups (which we have plenty of, anyhow).

Double checking

You sample your server once every X seconds. But in a failover scenario you want to make sure your data is up to date. When orchestrator suspects a dead master (i.e. cannot reach the master) it immediately contacts its direct slaves and checks their status.

Likewise, when orchestrator sees a first tier slave with broken IO thread, it immediately contacts the master to check if everything is fine.

For intermediate masters orchestrator is not so concerned and does not issue emergency checks.

How to fail over

Different story. Some other time. But failing over makes for complex decisions, based on who the replicating slaves are; with/out log-slave-updates; with-out GTID; with/out Pseudo-GTID; are binlog servers available; which slaves are available in which data centers. Or you may be using Galera (we’re not) which answers most of the above.

Anyway we use orchestrator for that; it knows our topologies, knows how they should look like, understands how to heal them, knows MySQL replication rules, and invokes external processes to do the stuff it doesn’t understand.

]]>
https://shlomi-noach.github.io/blog/mysql/what-makes-a-mysql-server-failurerecovery-case/feed 1 7274