High availability – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Tue, 22 May 2018 08:45:32 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 MySQL master discovery methods, part 5: Service discovery & Proxy https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy#respond Mon, 14 May 2018 08:08:32 +0000 https://shlomi-noach.github.io/blog/?p=7869 This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Service discovery and Proxy

Part 4 presented with an anti-pattern setup, where a proxy would infer the identify of the master by drawing conclusions from backend server checks. This led to split brains and undesired scenarios. The problem was the loss of context.

We re-introduce a service discovery component (illustrated in part 3), such that:

  • The app does not own the discovery, and
  • The proxy behaves in an expected and consistent way.

In a failover/service discovery/proxy setup, there is clear ownership of duties:

  • The failover tool own the failover itself and the master identity change notification.
  • The service discovery component is the source of truth as for the identity of the master of a cluster.
  • The proxy routes traffic but does not make routing decisions.
  • The app only ever connects to a single target, but should allow for a brief outage while failover takes place.

Depending on the technologies used, we can further achieve:

  • Hard cut for connections to old, demoted master M.
  • Black/hold off for incoming queries for the duration of failover.

We explain the setup using the following assumptions and scenarios:

  • All clients connect to master via cluster1-writer.example.net, which resolves to a proxy box.
  • We fail over from master M to promoted replica R.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Updates service discovery component that R is the new master for cluster1.

The proxy:

  • Either actively or passively learns that R is the new master, rewires all writes to go to R.
  • If possible, kills existing connections to M.

The app:

  • Needs to know nothing. Its connections to M fail, it reconnects and gets through to R.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted.

Everything is as before.

If the proxy kills existing connections to M, then the fact M is back alive turns meaningless. No one gets through to M. Clients were never aware of its identity anyhow, just as they are unaware of R‘s identity.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • In the process of promotion, M turned read-only.
  • Immediately following promotion, our failover tool updates service discovery.
  • Proxy reloads having seen the changes in service discovery.
  • Our app connects to R.

Discussion

This is a setup we use at GitHub in production. Our components are:

  • orchestrator for failover tool.
  • Consul for service discovery.
  • GLB (HAProxy) for proxy
  • Consul template running on proxy hosts:
    • listening on changes to Consul’s KV data
    • Regenerate haproxy.cfg configuration file
    • reload haproxy

As mentioned earlier, the apps need not change anything. They connect to a name that is always resolved to proxy boxes. There is never a DNS change.

At the time of failover, the service discovery component must be up and available, to catch the change. Otherwise we do not strictly require it to be up at all times.

For high availability we will have multiple proxies. Each of whom must listen on changes to K/V. Ideally the name (cluster1-writer.example.net in our example) resolves to any available proxy box.

  • This, in itself, is a high availability issue. Thankfully, managing the HA of a proxy layer is simpler than that of a MySQL layer. Proxy servers tend to be stateless and equal to each other.
  • See GLB as one example for a highly available proxy layer. Cloud providers, Kubernetes, two level layered proxies, Linux Heartbeat, are all methods to similarly achieve HA.

See also:

Sample orchestrator configuration

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "KVClusterMasterPrefix": "mysql/master",
  "ConsulAddress": "127.0.0.1:8500",
  "ZkAddress": "srv-a,srv-b:12181,srv-c",
  "PostMasterFailoverProcesses": [
    “/just/let/me/know about failover on {failureCluster}“,
  ],

In the above:

  • If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
  • At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
  • PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy/feed 0 7869
MySQL master discovery methods, part 4: Proxy heuristics https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics#respond Thu, 10 May 2018 06:10:34 +0000 https://shlomi-noach.github.io/blog/?p=7867 Note: the method described here is an anti pattern

This is the fourth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Proxy Heuristics

In Proxy Heuristics all clients connect to the master through a proxy. The proxy observes the backend MySQL servers and determines who the master is.

This setup is simple and easy, but is an anti pattern. I recommend against using this method, as explained shortly.

Clients are all configured to connect to, say, cluster1-writer.proxy.example.net:3306. The proxy will intercept incoming requests either based on hostname or by port. It is aware of all/some MySQL backend servers in that cluster, and will route traffic to the master M.

A simple heuristic that I’ve seen in use is: pick the server that has read_only=0, a very simple check.

Let’s take a look at how this works and what can go wrong.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.

Success, we’re happy.

Configuration tip

With an automated failover solution, use read_only=1 in my.cnf at all times. Only the failover solution will set a server to read_only=0.

With this configuration, when M restarts, MySQL starts up as read_only=1.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • 10 seconds later M comes back to life, claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.
  • The proxy has no context. It does not know why both are reporting the same. It is unaware of failovers. All it sees is what the backend MySQL servers report.

Therein lies the problem: you can not trust multiple servers (MySQL backends) to deterministically pick a leader (the master) without them collaborating on some elaborate consensus communication.

A non planned failover illustration #3

Master M box is overloaded, issuing too many connections for incoming connections.

Our tool decides to failover.

  • And doesn’t need to run any hooks.

The proxy:

  • Notices M fails health checks (select @@global.read_only does not respond because of the load).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • Shortly followed by M recovering (since no more writes are sent its way), claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.

Again, the proxy has no context, and neither do M and R, for that matter. The context (the fact we failed over from M to R) was known to our failover tool, but was lost along the way.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we set it to read_only=1.
  • We set R to read_only=0.
  • All new connections route to R.
  • We should also instruct our Proxy to kill all previous connections to M.

This works very nicely.

Discussion

There is a substantial risk to this method. Correlation between failover and network partitioning/load (illustrations #2 and #3) is reasonable.

The root of the problem is that we expect individual servers to resolve conflicts without speaking to each other: we expect the MySQL servers to correctly claim “I’m the master” without context.

We then add to that problem by using the proxy to “pick a side” without giving it any context, either.

Sample orchestrator configuration

By way of discouraging use of this method I do not present an orchestrator configuration file.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics/feed 0 7867
MySQL master discovery methods, part 3: app & service discovery https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-3-app-service-discovery https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-3-app-service-discovery#respond Tue, 08 May 2018 08:02:19 +0000 https://shlomi-noach.github.io/blog/?p=7865 This is the third in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

App & service discovery

Part 1 and part 2 presented solutions where the app remained ingorant of master’s identity. This part takes a complete opposite direction and gives the app ownership on master access.

We introduce a service discovery component. Commonly known are Consul, ZooKeeper, etcd, highly available stores offering key/value (K/V) access, leader election or full blown service discovery & health.

We satisfy ourselves with K/V functionality. A key would be mysql/master/cluster1 and a value would be the master’s hostname/port.

It is the app’s responsibility at all times to fetch the identity of the master of a given cluster by querying the service discovery component, thereby opening connections to the indicated master.

The service discovery component is expected to be up at all times and to contain the identity of the master for any given cluster.

A non planned failover illustration #1

Master M has died. R gets promoted in its place. Our recovery tool:

  • Updates the service discovery component, key is mysql/master/cluster1, value is R‘s hostname.

Clients:

  • Listen on K/V changes, recognize that master’s value has changed.
  • Reconfigure/refresh/reload/do what it takes to speak to new master and to drop connections to old master.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool (as before):

  • Updates the service discovery component, key is mysql/master/cluster1, value is R‘s hostname.

Clients (as before):

  • Listen on K/V changes, recognize that master’s value has changed.
  • Reconfigure/refresh/reload/do what it takes to speak to new master and to drop connections to old master.
  • Any changes not taking place in a timely manner imply some connections still use old master M.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • App should start connecting to R.

Discussion

The app is the complete owner. This calls for a few concerns:

  • How does a given app refresh and apply the change of master such that no stale connections are kept?
    • Highly concurrent apps may be more difficult to manage.
  • In a polyglot app setup, you will need all clients to use the same setup. Implement same listen/refresh logic for Ruby, golang, Java, Python, Perl and notably shell scripts.
    • The latter do not play well with such changes.
  • How can you validate that the change of master has been detected by all app nodes?

As for the service discovery:

  • What load will you be placing on your service discovery component?
    • I was familiar with a setup where there were so many apps and app nodes and app instances, such that the amount of connections was too much for the service discovery . In that setup caching layers were created, which introduced their own consistency problems.
  • How do you handle service discovery outage?
    • A reasonable approach is to keep using last known master idendity should service discovery be down. This, again, plays better wih higher level applications, but less so with scripts.

It is worth noting that this setup does not suffer from geographical limitations to the master’s identity. The master can be anywhere; the service discovery component merely points out where the master is.

Sample orchestrator configuration

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "KVClusterMasterPrefix": "mysql/master",
  "ConsulAddress": "127.0.0.1:8500",
  "ZkAddress": "srv-a,srv-b:12181,srv-c",
  "PostMasterFailoverProcesses": [
    “/just/let/me/know about failover on {failureCluster}“,
  ],

In the above:

  • If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
  • At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
  • PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-3-app-service-discovery/feed 0 7865
MySQL master discovery methods, part 2: VIP & DNS https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-2-vip-dns https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-2-vip-dns#comments Mon, 07 May 2018 06:46:22 +0000 https://shlomi-noach.github.io/blog/?p=7863 This is the second in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via VIP

In part 1 we saw that one the main drawbacks of DNS discovery is the time it takes for the apps to connect to the promoted master. This is the result of both DNS deployment time as well as client’s TTL.

A quicker method is offered: use of VIPs (Virtual IPs). As before, apps would connect to cluster1-writer.example.net, cluster2-writer.example.net, etc. However, these would resolve to specific VIPs.

Say cluster1-writer.example.net resolves to 10.10.0.1. We let this address float between servers. Each server has its own IP (say 10.20.0.XXX) but could also potentially claim the VIP 10.10.0.1.

VIPs can be assigned by switches and I will not dwell into the internals, because I’m not a network expert. However, the following holds:

  • Acquiring a VIP is a very quick operation.
  • Acquiring a VIP must take place on the acquiring host.
  • A host may be unable to acquire a VIP should another host holds the same VIP.
  • A VIP can only be assigned within a bounded space: hosts connected to the same switch; hosts in the same Data Center or availability zone.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Attempts to connect to M so that it can give up the VIP. The attempt fails because M is dead.
  • Connects to R and instructs it to acquire the VIP. Since M is dead there is no objection, and R successfully grabs the VIP.
  • Any new connections immediately route to the new master R.
  • Clients with connections to M cannot connect, issue retries, immediately route to R.

A non planned failover illustration #2

Master M gets network isolated for 30 seconds, during which time we failover. R gets promoted. Our tool:

  • Attempts to connect to M so that it can give up the VIP. The attempt fails because M is network isolated.
  • Connects to R and instructs it to acquire the VIP. Since M is network isolated there is no objection, and R successfully grabs the VIP.
  • Any new connections immediately route to the new master R.
  • Clients with connections to M cannot connect, issue retries, immediately route to R.
  • 30 seconds later M reappears, but no one pays any attention.

A non planned failover illustration #3

Master M box is overloaded. It is not responsive to new connections but may slowly serves existing connections. Our tool decides to failover:

  • Attempts to connect to M so that it can give up the VIP. The attempt fails because M is very loaded.
  • Connects to R and instructs it to acquire the VIP. Unfortunately, M hasn’t given up the VIP and still shows up as owning it.
  • All existing and new connections keep on routing to M, even as R is the new master.
  • This continues until some time has passed and we are able to manually grab the VIP on R, or until we forcibly network isolate M or forcibly shut it down.

We suffer outage.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we ask it to give up the VIP, which is does.
  • We ask R to grab the VIP, which it does.
  • All new connections route to R.
  • We may still see old connections routing to M. We can forcibly network isolate M to break those connections so as to cause reconnects, or restart apps.

Discussion

As with DNS discovery, the apps are never told of the change. They may be forcibly restarted though.

Grabbing a VIP is a quick operation. However, consider:

  • It is not guaranteed to succeed. I have seen it fail in various situations.
  • Since releasing/acquiring of VIP can only take place on the demoted/promoted servers, respectively, our failover tool will need to:
    • Remote SSH onto both boxes, or
    • Remote exec a command on those boxes
  • Moreover, the tool will do so sequentially. First we must connect to demoted master to give up the VIP, then to promoted master to acquire it.
  • This means the time at which the new master grabs the VIP depends on how long it takes to connect to the old master to give up the VIP. Seeing that the old master had trouble causing failover, we can expect correlation to not being able to connect to old master, or seeing slow connect time.
  • An alternative exists, in the form of Pacemaker. Consider Percona’s Replication Manager guide for more insights. Pacemaker provides a single point of access from where the VIP can be moved, and behind the scenes it will communicate to relevant nodes. This makes it simpler on the failover solution configuration.
  • We are constrained by physical location.
  • It is still possible for existing connection to keep on communicating to the demoted master, even while the VIP has been moved.

VIP & DNS combined

Per physical location, we could choose to use VIP. But should we need to failover to a server in another DC, we could choose to combine the DNS discovery, discussed in part 1.

We can expect to see faster failover time on a local physical location, and longer failover time on remote location.

Sample orchestrator configuration

What kind of remote exec method will you have? In this sample we will use remote (passwordless) SSH.

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PostMasterFailoverProcesses": [
    "ssh {failedHost} 'sudo ifconfig the-vip-interface down'",
    "ssh {successorHost} 'sudo ifconfig the-vip-interface up'",
    "/do/what/you/gotta/do to apply dns change for {failureClusterAlias}-writer.example.net to {successorHost}"
  ],  

In the above:

  • Replace SSH with any remote exec method you may use.
    • But you will need to set up the access/credentials for orchestrator to run those operations.
  • Replace ifconfig with service quagga stop/start or any method you use to release/grab VIPs.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-2-vip-dns/feed 3 7863
MySQL master discovery methods, part 1: DNS https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-1-dns https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-1-dns#respond Thu, 03 May 2018 10:56:46 +0000 https://shlomi-noach.github.io/blog/?p=7861 This is the first in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via DNS

In DNS master discovery applications connect to the master via a name that gets resolved to the master’s box. By way of example, apps would target the masters of different clusters by connecting to cluster1-writer.example.net, cluster2-writer.example.net, etc. It is up for the DNS to resolve those names to IPs.

Issues for concern are:

  • You will likely have multiple DNS servers. How many? In which data centers / availability zones?
  • What is your method for distributing/deploying a name change to all your DNS servers?
  • DNS will indicate a TTL (Time To Live) such that clients can cache the IP associated with a name for a given number of seconds. What is that TTL?

As long as things are stable and going well, discovery via DNS makes sense. Trouble begins when the master fails over. Assume M used to be the master, but got demoted. Assume R used to be a replica, that got promoted and is now effectively the master of the topology.

Our failover solution has promoted R, and now needs to somehow apply the change, such that the apps connect to R instead of M. Some notes:

  • The apps need not change configuration. They should still connect to cluster1-writer.example.net, cluster2-writer.example.net, etc.
  • Our tool instructs DNS servers to make the change.
  • Clients will still resolve to old IP based on TTL.

A non planned failover illustration #1

Master M dies. R gets promoted. Our tool instructs all DNS servers on all DCs to update the IP address.

Say TTL is 60 seconds. Say update to all DNS servers takes 10 seconds. We will have between 10 and 70 seconds until all clients connect to the new master R.

During that time they will continue to attempt connecting to M. Since M is dead, those attempts will fail (thankfully).

A non planned failover illustration #2

Master M gets network isolated for 30 seconds, during which time we failover. R gets promoted. Our tool instructs all DNS servers on all DCs to update the IP address.

Again, assume TTL is 60 seconds. As before, it will take between 10 and 70 seconds for clients to learn of the new IP.

Clients who will require between 40 and 70 seconds to learn of the new IP will, however, hit an unfortunate scenario: the old master M reappears on the grid. Those clients will successfully reconnect to M and issue writes, leading to data loss (writes to M no longer replicate anywhere).

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R. We need to change DNS records. Since this is a planned failover, we set the old master to read_only=1, or even better, we network isolated it.

And still our clients take 10 to 70 seconds to recognize the new master.

Discussion

The above numbers are just illustrative. Perhaps DNS deployment is quicker than 10 seconds. You should do your own math.

TTL is a compromise which you can tune. Setting lower TTL will mitigate the problem, but will cause more hits on the DNS servers.

For planned takeover we can first deploy a change to the TTL, to, say, 2sec, wait 60sec, then deploy the IP change, then restore TTL to 60.

You may choose to restart apps upon DNS deployment. This emulates apps’ awareness of the change.

Sample orchestrator configuration

orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PostMasterFailoverProcesses": [
    "/do/what/you/gotta/do to apply dns change for {failureClusterAlias}-writer.example.net to {successorHost}"
  ],  

In the above:

  • ApplyMySQLPromotionAfterMasterFailover instructs orchestrator to set read_only=0; reset slave all on promoted server.
  • PostMasterFailoverProcesses really depends on your setup. But orchestrator will supply with hints to your scripts: identity of cluster, identity of successor.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-1-dns/feed 0 7861
orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more https://shlomi-noach.github.io/blog/mysql/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more https://shlomi-noach.github.io/blog/mysql/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more#respond Mon, 29 Jan 2018 09:40:05 +0000 https://shlomi-noach.github.io/blog/?p=7833 orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:

Faster failure detection

Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator now detects failure faster than before:

  • A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every 5sec, failure detection time dropped from 7-10sec to 3-5sec, keeping reliability. The reduction in time does not lead to increased false positives.
    Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster).
  • Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, orchestrator now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec.

Faster master recoveries

Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.

With recent changes, orchestrator is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator is able to immediate promote it (i.e. run hooks, set read_only=0 etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.

This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.

Between faster detection and faster recoveries, we’re looking at some 10sec reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec in almost all cases, and < 15s in optimal cases. Those times are measured on our failover tests.

We are working on reducing failover time unrelated to orchestrator and hope to update soon.

Automated Pseudo-GTID

As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows.

There’s still many setups out there where GTID is not (yet?) deployed and enabled. However, Pseudo-GTID is often misunderstood, and though I’ve blogged and presented Pseudo-GTID many times in the past, I still find myself explaining to people the setup is simple and does not involve change to one’s topologies.

Well, it just got simpler. orchestrator is now able to automatically inject Pseudo-GTID for you.

Say the word: "AutoPseudoGTID": true, grant the necessary privilege, and your non-GTID topology is suddenly supercharged with magical Pseudo-GTID tokens that provide you with:

  • Arbitrary relocation of replicas
  • Automated or manual failovers (masters and intermediate masters)
  • Vendor freedom: runs on Oracle MySQL, Percona Server, MariaDB, or all of the above at the very same time.
  • Version freedom (still on 5.5? No problem. Oh, this gets you crash-safe replication as extra bonus, too)

Auto-Pseudo-GTID further simplifies the infrastructure in that you no longer need to take care of injecting Pseudo-GTID onto the master as well as handle master identity changes. No more event_scheduler to enable/disable nor services to start/stop.

More and more setups are moving to GTID. We may, too! But I find it peculiar that Pseudo-GTID was suggested 4 years ago, when 5.6 GTID was already released, and still many setups are not yet running GTID. If you’re not using GTID, please try Pseudo-GTID! Read more.

Semi-sync support

Semi-sync has been internally supported via a specialized patch contributed by Vitess, to flag a server as semi-sync-able and handle enablement of semi-sync upon master failover.

orchestrator now supports semi-sync more generically. You may use orchestrator to enable/disable semi-sync master/replica side, via orchestrator -c enable-semi-sync-master, orchestrator -c enable-semi-sync-replica, orchestrator -c disable-semi-sync-master, orchestrator -c disable-semi-sync-replica commands (or API equivalent).

The API will also tell you whether semi-sync is enabled on instances. Noteworthy that configured != enabled. A server can be configured with rpl_semi_sync_master_enabled=ON, but if no semi-sync replicas are found, the Rpl_semi_sync_master_status state is OFF.

More

UI changes, removal of prepared statements, documentation updates, raft updates…

orchestrator is free and open source and released under the Apache 2 license. It is authored at and used by GitHub.

I’ll be presenting orchestrator/raft in FOSDEM next week, at the MySQL and Friends Room.

]]>
https://shlomi-noach.github.io/blog/mysql/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more/feed 0 7833
orchestrator/raft: Pre-Release 3.0 https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0 https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0#comments Thu, 03 Aug 2017 08:41:11 +0000 https://shlomi-noach.github.io/blog/?p=7740 orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script.

TL;DR

You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator‘s backend MySQL servers (such as Galera/InnoDB Cluster). In fact, you can run a orchestrator/raft setup using embedded SQLite backend DB. Read on.

orchestrator still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to 3.0 and do not configure raft.

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLite is supported. orchestrator now ships with SQLite embedded, no external dependency needed.

In a orchestrator/raft setup, all orchestrator nodes talk to each other. One and only one is elected as leader. To become a leader a node must be part of a quorum. On a 3 node setup, it takes 2 nodes to form a quorum. On a 5 node setup, it takes 3 nodes to form a quorum.

Only the leader will run failovers. This much is similar to the existing shared-backend DB setup. However in a orchestrator/raft setup each node is independent, and each orchestrator node runs discoveries. This means a MySQL server in your topology will be routinely visited and probed by not one orchestrator node, but by all 3 (or 5, or what have you) nodes in your raft cluster.

Any communication to orchestrator must take place through the leader. One may not tamper directly with the backend DBs anymore, since the leader is the one authoritative entity to replicate and announce changes to its peer nodes. See orchestrator-client section following.

For details, please refer to the documentation:

The orchetrator/raft setup comes to solve several issues, the most obvious is high availability for the orchestrator service: in a 3 node setup any single orchestrator node can go down and orchestrator will reliably continue probing, detecting failures and recovering from failures.

Another issue solve by orchestrator/raft is network isolation, in particularly cross-DC, also refered to as fencing. Some visualization will help describe the issue.

Consider this 3 data-center replication setup. The master, along with a few replicas, resides on DC1. Two additional DCs have intermediate masters, aka local-masters, that relay replication to local replicas.

We place 3 orchestrator nodes in a raft setup, each in a different DC. Note that traffic between orchestrator nodes is very low, and cross DC latencies still conveniently support the raft communication. Also note that backend DB writes have nothing to do with cross-DC traffic and are unaffected by latencies.

Consider what happens if DC1 gets network isolated: no traffic in or out DC1

Each orchestrator nodes operates independently, and each will see a different state. DC1’s orchestrator will see all servers in DC2, DC3 as dead, but figure the master itself is fine, along with its local DC1 replicas:

However both orchestrator nodes in DC2 and DC3 will see a different picture: they will see all DC1’s servers as dead, with local masters in DC2 and DC3 having broken replication:

Who gets to choose?

In the orchestrator/raft setup, only the leader runs failovers. The leader must be part of a quorum. Hence the leader will be an orchestrator node in either DC2 or DC3. DC1’s orchestrator will know it is isolated, that it isn’t part of the quorum, hence will step down from leadership (that’s the premise of the raft consensus protocol), hence will not run recoveries.

There will be no split brain in this scenario. The orchestrator leader, be it in DC2 or DC3, will act to recover and promote a master from within DC2 or DC3. A possible outcome would be:

What if you only have 2 data centers?

In such case it is advisable to put two orchestrator nodes, one in each of your DCs, and a third orchestrator node as a mediator, in a 3rd DC, or in a different availability zone. A cloud offering should do well:

The orchestrator/raft setup plays nice and allows one to nominate a preferred leader.

SQLite

Suggested and requested by many, is to remove orchestrator‘s own dependency on a MySQL backend. orchestrator now supports a SQLite backend.

SQLite is a transactional, relational, embedded database, and as of 3.0 it is embedded within orchestrator, no external dependency required.

SQLite doesn’t replicate, doesn’t support client/server protocol. As such, it cannot work as a shared database backend. SQLite is only available on:

  • A single node setup: good for local dev installations, testing server, CI servers (indeed, SQLite now runs in orchestrator‘s CI)
  • orchestrator/raft setup, where, as noted above, backend DBs do not communicate with each other in the first place and are each dedicated to their own orchestrator node.

It should be pointed out that SQLite is a great transactional database, however MySQL is more performant. Load on backend DB is directly (and mostly linearly) affected by the number of probed servers. If you have 50 servers in your topologies or 500 servers, that matters. The probing frequency of course also matters for the write frequency on your backend DB. I would suggest if you have thousands of backend servers, to stick with MySQL. If dozens, SQLite should be good to go. In between is a gray zone, and at any case run your own experiments.

At this time SQLite is configured to commit to file; there is a different setup where SQLite places data in-memory, which makes it faster to execute. Occasional dumps required for durability. orchestrator may support this mode in the future.

orchestrator-client

You install orchestrator as a service on a few boxes; but then how do you access it from other hosts?

  • Either curl the orchestrator API
  • Or, as most do, install orchestrator-cli package, which includes the orchestrator binary, everywhere.

The latter implies:

  • Having the orchestrator binary installed everywhere, hence updated everywhere.
  • Having the /etc/orchestrator.conf.jsondeployed everywhere, along with credentials.

The orchestrator/raft setup does not support running orchestrator in command-line mode. Reason: in this mode orchestrator talks directly to the shared backend DB. There is no shared backend DB in the orchestrator/raft setup, and all communication must go through the leader service. This is a change of paradigm.

So, back to curling the HTTP API. Enter orchestrator-client which mimics the command line interface, while running curl | jq requests against the HTTP API. orchestrator-client, however, is just a shell script.

orchestrator-client will work well on either orchestrator/raft or on your existing non-raft setups. If you like, you may replace your remote orchestrator installations and your /etc/orchestrator.conf.json deployments with this script. You will need to provide the script with a hint: the $ORCHESTRATOR_API environment variable should be set to point to the orchestrator HTTP API.

Here’s the fun part:

  • You will either have a proxy on top of your orchestrator service cluster, and you would export ORCHESTRATOR_API=http://my.orchestrator.service/api
  • Or you will provide orchestrator-client with all orchestrator node identities, as in export ORCHESTRATOR_API="https://orchestrator.host1:3000/api https://orchestrator.host2:3000/api https://orchestrator.host3:3000/api" .
    orchestrator-client will figure the identity of the leader and will forward requests to the leader. At least scripting-wise, you will not require a proxy.

Status

orchestrator 3.0 is a Pre-Release. We are running a mostly-passive orchestrator/raft setup in production. It is mostly-passive in that it is not in charge of failovers yet. Otherwise it probes and analyzes our topologies, as well as runs failure detection. We will continue to improve operational aspects of the orchestrator/raft setup (see this issue).

 

]]>
https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0/feed 4 7740
What’s so complicated about a master failover? https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover#comments Thu, 29 Jun 2017 15:01:58 +0000 https://shlomi-noach.github.io/blog/?p=7726 The more work on orchestrator, the more user input and the more production experience, the more insights I get into MySQL master recoveries. I’d like to share the complexities in correctly running general-purpose master failovers; from picking up the right candidates to finalizing the promotion.

The TL;DR is: we’re often unaware of just how things can turn at the time of failover, and the impact of every single decision we make. Different environments have different requirements, and different users wish to have different policies. Understanding the scenarios can help you make the right choice.

The scenarios and considerations below are ones I picked while browsing through the orchestrator code and through Issues and questions. There are more. There are always more scenarios.

I discuss “normal replication” scenarios below; some of these will apply to synchronous replication setups (Galera, XtraDB Cluster, InnoDB Cluster) where using cross DC, where using intermediate masters, where working in an evolving environment.

orchestrator-wise, please refer to “MySQL High Availability tools” followup, the missing piece: orchestrator, an earlier post. Some notions from that post are re-iterated here.

Who to promote?

Largely covered by the missing piece post (skip this section if you’ve read said post), consider the following:

  • You run with a mixed versions setup. Your master is 5.6, most your replicas are 5.6 but you’ve upgraded a couple replicas to 5.7. You must not promote those 5.7 servers since you cannot replicate 5.7->5.6. You may lose these servers upon failover.
  • But perhaps by now you’ve upgraded most of your replicas to 5.7, in which case you prefer to promote a 5.7 server in the event the alternative is losing your 5.7 fleet.
  • You run with both STATEMENT based replication and ROW based. You must not promote a ROW based replica because a STATEMENT based server cannot replicate from it. You may lose ROW servers during the failover.
  • But perhaps by now you’ve upgraded most of your replicas to ROW, in which case you prefer to promote a ROW server in the event the alternative is losing your ROW fleet.
  • Some servers can’t be promoted because they don’t use binary logging or log_slave_updates. They could be lost in action.

Noteworthy that MHA solves the above by syncing relay logs across the replicas. I had an attempt at doing the same for orchestrator but was unsatisfied with the results and am wary of hidden assumptions. I do not expect to continue working on that.

Intermediate masters

Recovery of intermediate masters, while simpler, also adds many more questions to the table.

  • An intermediate master crashes. What is the correct handling? Should you move all of its orphaned replicas under another server? Or does this group of replicas form some pact that must stick together? Perhaps you insist on promoting one of them on top of its siblings.
  • On failure, you are likely to prefer promoted server from same DC; this will have least impact on your application.
    • But are you willing to lose a server or two to make that so? Or do you prefer switching to a different DC and not lose any server in the process?
    • A specific, large orchestrator user actually wants to failover to a different DC. Not only that, the customer then prefers flipping all other cluster masters to the other DC (a full-DC failover)
  • An intermediate master had replication filters (scenario: you were working to extract and refactor a subtree onto its own cluster, but the IM crashed before you did so)
    • What do you do? Did you have all subsequent replicas run with same filters?
    • If not, do you have the playbook to do so at time of failure?
  • An intermediate master was writable. It crashed. What do you do? Who is a legitimate replacement? How can you even reconnect the subtree with the main tree?

Candidates

  • Do you prefer some servers over others? Some servers have stronger hardware; you’d like them to be promoted if possible
    • orchestrator can juggle with that to some extent.
  • Are there servers you never want to promote? Servers used by developers; used for backups with open logical volumes; weaker hardware; a DC you don’t want to failover to; …
    • But then again, maybe it’s fine if those servers act as intermediate masters? So that they must not be promoted as masters, but are good to participate in an intermediate master failover?
  • How do you even define the promotion types for those servers?
    • We strongly prefer to do this live. Service discovery dictates the type of “promotion rule”; a recurring cronjob keeps updating orchestrator with the server’s choice of promotion rule.
    • We strongly discourage configuration based rules, unless for servers which are obviously-never-promote.

What to respond to?

What is a scenario that kicks a failover?

Relate to the missing piece post for the holistic approach orchestrator takes to make a reliable detection. But regardless, do you want to take action where:

  • The master is completely dead and everyone sees that and agrees? (resounding yes)
  • The master is dead to the application but replication seems to be working? (master is at deadlock, but replication threads seem to be happy)
  • The master is half-dead to the application? (no new connections; old connections includign replication connections keep on running!)
  • A DC is network partitioned, the master is alive with some replicas in that DC; but the majority of the replicas are in other DCs, unable to see the master?
    • Is this a question of majority? Of DC preference? Is there at all an answer?

Upon promotion

Most people expect orchestrator to RESET SLAVE ALL and SET read_only=0 upon promotion. This is possible, but the default is not to do so. Why?

  • What if your promoted server still has unapplied relay logs? This can happen in the event all replicas were lagging at the time of master failure. Do you prefer:
    • To promote, RESET SLAVE ALL and lose all those relay logs? You gain availability at the expense of losing data.
    • To wait till SQL_THREAD has consumed the logs? You keep your data at the expense of availability.
    • To abort? You let a human handle this; this is likely to take more time.
  • What do you do with delayed/slow replicas? It could take a while to connect them back to the promoted master. Do you prefer:
    • Waiting for them to connect; delay promotion
    • Advertise new master, then asynchronously work to connect them: you may have improved availability, but at reduced capacity, which is an availability issue in itself.

Flapping

You wish to avoid flapping. A scenario could be that you’re placing such load on your master that it crashes; the next server to promote as master will have the same load, will similarly crash. You do not wish to exhaust your fleet.

What makes a reasonable anti-flapping rule? Options:

  • Upon any failure in a cluster, block any other failover on that cluster for X minutes
    • What happens if a major power down issue requires two or three failovers on the same cluster?
  • Only block further master failovers, but allow intermediate master failovers as much as needed
    • There could be intermediate master exhaustion, as well
  • Only allow one of each (master and intermediate master), then block for X minutes?
  • Allow a burst of failovers for a duration of N seconds, then block for X minutes?

Detection spam

You don’t always take action on failure detection. Maybe you’re blocked via anti-flapping on an earlier failure; or have configured to not automatically failover.

Detection is the basis to, but independent of failover. You should have detection in place even if not failing over.

You wish to run detection continuously.

  • So if failover does not take place, detection will keep noticing the same problem again and again.
    • You get spammed by alerts
  • Only detect once?
    • Been there. When you really need that detection to alert you find out it alerted once 6 hours ago and you ignored it because it was not actionable at the time.
  • Only detect a change in diagnosis?
    • Been there. Diagnosis itself can flap back and forth. You get noise.
  • Block detection for X minutes?
    • What is a good tradeoff between noise and visibility?

Back to life

And old sub-tree comes back to life. Scenario: DC power failure.

This subtree claims “I’m cluster X”. What happens?

  • Your infrastructure needs to have memory and understanding that there’s already a running cluster called X.
  • Any failures within that subtree must not be interpreted as a “cluster X failure”, or else you kick a cluster X failover when there is, in fact, a running cluster X. See this PR and related links. At this time orchestrator handles this scenario correctly.
  • When do you consider a previously-dead server to be alive and well?
    • I do mean, automatically and reliably so? See same PR for orchestrator‘s take.

User control

Failover tooling must always let a human decide something is broken. It must always allow for an urgent failover, even if nothing seems to be wrong to the system.

Isn’t this just so broken? Isn’t synchronous replication the answer?

The world is broken, and distributed systems are hard.

Synchronous replication is an answer, and solves many (I think) of the above issues, creating its own issues, but I’m not an expert on that.

However noteworthy that when some people think about synchronous replication they forget about cross-DC replication and cross-DC failovers,on upgrades and experiments. The moment you put intermediate masters at play, you’re almost back to square one with many of the above questions again applicable to your use case.

]]>
https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover/feed 4 7726
“MySQL High Availability tools” followup, the missing piece: orchestrator https://shlomi-noach.github.io/blog/mysql/mysql-high-availability-tools-followup-the-missing-piece-orchestrator https://shlomi-noach.github.io/blog/mysql/mysql-high-availability-tools-followup-the-missing-piece-orchestrator#comments Thu, 06 Apr 2017 12:56:45 +0000 https://shlomi-noach.github.io/blog/?p=7690 I read with interest MySQL High Availability tools – Comparing MHA, MRM and ClusterControl by SeveralNines. I thought there was a missing piece in the comparison: orchestrator, and that as result the comparion was missing scope and context.

I’d like to add my thoughts on topics addressed in the post. I’m by no means an expert on MHA, MRM or ClusterControl, and will mostly focus on how orchestrator tackles high availability issues raised in the post.

What this is

This is to add insights on the complexity of failovers. Over the duration of three years, I always think I’ve seen it all, and then get hit by yet a new crazy scenario. Doing the right thing automatically is difficult.

In this post, I’m not trying to convince you to use orchestrator (though I’d be happy if you did). To be very clear, I’m not claiming it is better than any other tool. As always, each tool has pros and cons.

This post does not claim other tools are not good. Nor that orchestrator has all the answers. At the end of the day, pick the solution that works best for you. I’m happy to use a solution that reliably solves 99% of the cases as opposed to an unreliable solution that claims to solve 99.99% of the cases.

Quick background

orchestrator is actively maintained by GitHub. It manages automated failovers at GitHub. It manages automated failovers at Booking.com, one of the largest MySQL setups on this planet. It manages automated failovers as part of Vitess. These are some names I’m free to disclose, and browsing the issues shows a few more users running failovers in production. Otherwise, it is used for topology management and visualization in a large number of companies such as Square, Etsy, Sendgrid, Godaddy and more.

Let’s now follow one-by-one the observations on the SeveralNines post.

Flapping

orchestrator supports an anti-flapping mechanism. Once an automated failover kicks off, no additional automated failover will run on the same cluster for the duration of a pre-configured <code>RecoveryPeriodBlockSeconds</code>.  We, for example, set that time to 1 hour.

However, orchestrator also supports acknowledgements. A human (or a bot, for that matter, accessing the API or just running the command line) can acknowledge a failover. Once a failover is acknowledged, the block is removed. The next incident requiring a failover is free to proceed.

Moreover, a human is always allowed to forcibly invoke a failover (e.g. via orchestrator -c graceful-master-takeover or  orchestrator -c force-master-takeover). In such case, any blocking is ignored and orchestrator immediately kicks in the failover sequence.

Lost transactions

orchestrator does not pull binlog data from the failed master and only works with the data available on the replicas. There is a potential for data loss.

Note: There was actually some work into a MHA-like synching of relay logs, and in fact most of it is available right now in orchestrator, synching relaylogs via remote SSH and without agents. See https://github.com/github/orchestrator/issues/45 for some pointers. We looked into this for a couple months but saw some dangers and issues such as non-atomic relay log entries in RBR and others. We chose to put this on-hold, and I advise to not use this functionality in orchestrator.

Lost transactions, notes on semi-sync

Vitess have contributed a semi-sync failover mechanism (discussion). So you are using semi-sync? That’s wonderful. You’ve failed over and have greatly reduced the amount of lost data. What happens with your new setup? Will your recovery re-apply semi sync? Vitess’s contribution does just that and makes sure a new relica takes the semi-sync role.

Lost transactions, notes on “most up to date replica”

There is this fallacy. It has been proven to be a fallacy many time in production, that I’ve witnessed. I’m sure this happened somewhere else in the universe that I haven’t witnessed.

The fallacy says: “when master fails, we will promote the most up-to-date replica”. The most up-to-date may well be the wrong choice. You may wish to skip and lose the most up-to-date replica. The assumption is a fallacy, because many times our production environment is not sterile.

Consider:

  • You’re running 5.6 and wish to experiment/upgrade to 5.7. You will like upgrade a replica or two, measure their replication capacity, their query latency, etc. You now have a non-sterile topology. When the master fails, you must not promote your 5.7 replica because you will not be able to (or wouldn’t want to take the chance) replicate from that 5.7 onto the rest of your fleet, which is 5.6.
    • We are looking into 5.7 upgrade for a few months now. Most careful places I know likewise take months to run such an upgrade. These are months where your topology is not sterile.

So you’d rather lose the most up-to-date replica than lose the other 10 replicas you have.

  • Likewise, you run with STATEMENT based replication. You wish to experiment with ROW based replication. Your topology is not sterile. You must not upgrade your ROW replica, because the rest of the replicas (assuming log_slave_updates) would not be able to replicate.

orchestrator understands all the above and makes the right call. It would promote the replica that would ensure the survival of your fleet, and prefers to lose data.

  • You have a special replica with replication filters. More often than desired, this happens. Perhaps you’re splitting out a new sub-cluster, functionally partitioning away some tables. Or you’re running some exporter to Hadoop and only filter specific events. Or… You must not promote that replica.

orchestrator would not promote that replica.

Let’s look at something crazy now:

  • Getting back to the 5.7 upgrade, you have now upgraded 8 out of your 10 replicas to 5.7. You now do want to promote the 5.7 replica. It’s a matter of how many replicas you’d lose if you promoted _this one_ as opposed to _that one_.

orchestrator makes that calculation. Hey, the same applies for STATEMENT and ROW based replicas. Or maybe you have both 5.7 and RBR experiments (tsk tsk tsk, but life is challanging) at the same time. orchestrator will still pick the replica whose promotion will get the majority of your fleet intact.

  • The most up-to-date replica is on a different data center.

orchestrator can, but prefer not to promote it. But then, it also supports a 2 step promotion. If possible, it would first promote that most up-to-date replica from the other DC, then let local DC replicas catch up, then reverse replication and place a local replica on top, making the master on local DC again.

I cannot stress enough how useful and important this is.

This conveniently leads us to…

Roles

Fallacy: using configuration to white-list or black-list servers.

You’d expect to overcome the above 5.7 or ROW etc. issues by carefully marking your servers as blacklisted.

We’ve been there. This works on a setup with 10 servers. I claim this doesn’t scale.

Again, I wish to be clear: if this works for you, great! Ignore the rest of this section. I suggest that as you grow, this becomes more and more a difficult problem. At the scale of a 100 servers, it’s a serious pain. Examples:

  • You set global binlog_format='ROW'. Do you then immediately follow up to reconfigure your HA service to blacklist your server?
  • You provision a new box. Do you need to go and reconfigure your HA service, adding the box as white-listed?
  • You need to run maintenance on a server. Do you rush to reconfigure HA service?

You can invest in automating the reconfiguration of servers; I’ve been there as well, to some extent. This is a difficult task on its own.

  • You’d use service discovery to assign tags to hosts
  • And then puppet would distribute configuration.
    • How fast would it do that?
  • I’m not aware and am happy to learn if any of the mentioned solutions can be reconfigured by consul. You need to do the job yourself.

Also consider how flexible your tools are: suppose you reconfigure; how easy it is for your tools to reload and pick the new config?

  • orchestrator does its best, and reloads most configuration live, but even orchestrator has some “read-only” variables.

How easy it is to restart your HA services? Can you restart them in a rolling fashion, such that at any given point you have HA up and running?

  • This is well supported by orchestrator

orchestrator recognizes those crazy 5.7ROW-replication filters topologies by magic. Well, not magic. It just observes the state of the topologies. It learns the state of the topologies, so that at time of crash it has all the info. Its logic is based on that info. It knows and understands replication rules. It computes a good promotion path, taking data centers and configurations into consideration.

Initially, orchestrator started with blacklists in configuration. They’re still supported. But today it’s more about labels.

  • One of your replicas serves as a backup server. It runs extra tasks, or configured differently (no log_slave_updates? Different buffer pool settings?) and is not good to serve as a master. You should not promote that replica.

Instead of saying “this server is a backup server and should never be promoted” in configuration — (and that’s possible to do!), orchestrator lets you dynamically announce that this server should not be promoted. Such that maybe 10 minutes ago this wasn’t a backup server, but now it is. You can advertise that fact to orchestrator. We do that via:

orchestrator -c register-candidate -i {::fqdn} --promotion-rule=${promotion_rule}

where ${promotion_rule} is candidate, neutral or must_not. We run this from cron, every couple minutes. How we choose the right rule comes from our service discovery. orchestrator is always up-to-date (up to a couple minutes) at worst of role changes (and urgent role changes get propagated immediately).

Also, each host can self-declare its role, so that orchestrator discovers the role on discover, as per DetectPromotionRuleQuery. Vitess is known to use this approach, and they contributed the code.

Network partitioning

First, about general failure detection.

orchestrator uses a holistic approach to detecting failures, and instead of justifying that it is a really good approach, I will testify that it is. orchestrator‘s accuracy in recognizing a failover scenario is very high. Booking.com has X,000 MySQL servers in production. at that scale, servers and networks break enough that there’s always something breaking. Quoting (with permission) Simon J. Mudd from Booking.com:

We have at least one failure a day which it handles automatically. that’s really neat. people don’t care about dead masters any more.

orchestrator looks not only at the master, but at its replicas. If orchestrator can’t see the master, but the replicas are happily running and replicating, then it’s just orchestrator who can’t see the master. But if orchestrator doesn’t see the master`, and all the replicas are broken, then there’s a failover scenario.

With ClusterControl, there seems to be danger of false positives: “…therefore it can take an action if there are network issues between the master and the ClusterControl host.”

As for fencing:

MHA uses multiple nodes for detecting the failure. Especially if these nodes run on different DCs, this gives MHA a better view when fencing is involved.

Last September we gathered at an improvised BoF session on Percona Live Amsterdam. We were looking at observing a failure scenario where fencing is involved.

We mostly converged onto having multiple observers, such as 3 observers on 3 different data centers, a quorum of which would decide if there’s a failure scenario or not.

The problem is difficult. If the master is seen by 2 nodes, but all of its replicas are broken, does that make a failure scenario or not? What if a couple replicas are happy but ten others are not?

orchestrator has it on the roadmap to run a quorum based failover decision. My estimation is that it will do the right thing most times and that it would be very difficult to push towards 99.99%.

Integration

orchestrator provides HTTP API as well as a command line interface. We use those at GitHub, for example, to integrate orchestrator into our chatops. We command orchestrator via chat and get information via chat.

Or we and others have automated, scheduled jobs, that use the orchestrator command line to rearrange topologies.

There is no direct integration between orchestrator and other tools. Recently there have been requests to integrate orchestrator with ProxySQL. I see multiple use cases for that. Plug: in the upcoming Percona Live conference in Santa Clara, René Cannaò and myself will co-present a BoF on potential ProxySQL-orchestrator integration. It will be an open discussion. Please come and share your thoughts! (The talk hasn’t been scheduled yet, I will update this post with the link once it’s scheduled).

Addressing the issue of read_only as an indicator to master/replica roles, please see the discussion on https://github.com/sysown/proxysql/issues/789. Hint: this isn’t trivial and many times not reliable. It can work in some cases.

Conclusion

Is it time already? I have much to add; but let’s stay focus on the SeveralNines blog. Addressing the comparison chart at the “Conclusion” section:

Replication support

orchestrator supports failovers on:

  • Oracle GTID
  • MariaDB GTID
  • Pseudo-GTID
  • Semi-sync
  • Binlog servers

GTID support is ongoing (recent example). Traditionally orchestrator is very focused on Pseudo-GTID.

I should add: Pseudo-GTID is a thing. It runs, it runs well. Pseudo-GTID provides with almost all the Oracle GTID advantages without actually using GTID, and without GTID limitations. The obvious disclaimer is that Pseudo-GTID has its own limitations, too. But I will leave the PSeudo-GTID preaching for another time, and just note that GitHub and Booking.com both run automated failovers based on Pseudo-GTID.

Flapping

Time based blocking per cluster, acknowledgements supported; human overrides permitted.

Lost transactions

No checking for transactions on master

Network Partitioning

Excellent false positive detection. Seems like other tools like MRM and ClusterControl are following up by adopting the orchestrator approach, and I’m happy for that.

Roles

State-based automatic detection and decision making; also dynamic roles via advertising; also support for self- declaring roles; also support for configuration based black lists

Integration

HTTP API, command line interfaces. No direct integration with other products. Looking into ProxySQL with no promises held.

Further

If you use non-GTID replication, MHA is the only option for you

That is incorrect. orchestrator is an excellent choice in my biased opinion. GitHub and Booking.com both run orchestrator automated failovers without using GTIDs.

Only ClusterControl … is flexible enough to handle both types of GTID under one tool

That is incorrect. orchestrator supports both types of GTID. I’m further working towards better and better supports of Oracle GTID.

…this could be very useful if you have a mixed environment while you still would like to use one single tool to ensure high availability of your replication setup

Just to give you an impression: orchestrator works on topologies mixed with Oracle and MariaDB servers, normal replication and binlog servers, STATEMENT and ROW based replication, three major versions in the same cluster (as I recall I’ve seen that; I don’t run this today), and mixed combination of the above. Truly.

Finally

Use whatever tool works best for you.

Oh, and I forgot!

Please consider attending my talk, Practical Orchestrator, where I will share practical advice and walkthough on `orchestrator setup.

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-high-availability-tools-followup-the-missing-piece-orchestrator/feed 7 7690
State of automated recovery via Pseudo-GTID & Orchestrator @ Booking.com https://shlomi-noach.github.io/blog/mysql/state-of-automated-recovery-via-pseudo-gtid-orchestrator-booking-com https://shlomi-noach.github.io/blog/mysql/state-of-automated-recovery-via-pseudo-gtid-orchestrator-booking-com#respond Fri, 20 Nov 2015 09:41:13 +0000 https://shlomi-noach.github.io/blog/?p=7453 This post sums up some of my work on MySQL resilience and high availability at Booking.com by presenting the current state of automated master and intermediate master recoveries via Pseudo-GTID & Orchestrator.

Booking.com uses many different MySQL topologies, of varying vendors, configurations and workloads: Oracle MySQL, MariaDB, statement based replication, row based replication, hybrid, OLTP, OLAP, GTID (few), no GTID (most), Binlog Servers, filters, hybrid of all the above.

Topologies size varies from a single server to many-many-many. Our typical topology has a master in one datacenter, a bunch of slaves in same DC, a slave in another DC acting as an intermediate master to further bunch of slaves in the other DC. Something like this, give or take:

booking-topology-sample

However as we are building our third data center (with MySQL deployments mostly completed) the graph turns more complex.

Two high availability questions are:

  • What happens when an intermediate master dies? What of all its slaves?
  • What happens when the master dies? What of the entire topology?

This is not a technical drill down into the solution, but rather on overview of the state. For more, please refer to recent presentations in September and April.

At this time we have:

  • Pseudo-GTID deployed on all chains
  • Pseudo-GTID based automated failover for intermediate masters on all chains
  • Pseudo-GTID based automated failover for masters on roughly 30% of the chains.
    • The rest of 70% of chains are set for manual failover using Pseudo-GTID.

Pseudo-GTID is in particular used for:

  • Salvaging slaves of a dead intermediate master
  • Correctly grouping and connecting slaves of a dead master
  • Routine refactoring of topologies. This includes:
    • Manual repointing of slaves for various operations (e.g. offloading slaves from a busy box)
    • Automated refactoring (for example, used by our automated upgrading script, which consults with orchestrator, upgrades, shuffles slaves around, updates intermediate master, suffles back…)
  • (In the works), failing over binlog reader apps that audit our binary logs.

Furthermore, Booking.com is also working on Binlog Servers:

  • These take production traffic and offload masters and intermediate masters
  • Often co-serve slaves using round-robin VIP, such that failure of one Binlog Server makes for simple slave replication self-recovery.
  • Are interleaved alongside standard replication
    • At this time we have no “pure” Binlog Server topology in production; we always have normal intermediate masters and slaves
  • This hybrid state makes for greater complexity:
    • Binlog Servers are not designed to participate in a game of changing masters/intermediate master, unless successors come from their own sub-topology, which is not the case today.
      • For example, a Binlog Server that replicates directly from the master, cannot be repointed to just any new master.
      • But can still hold valuable binary log entries that other slaves may not.
    • Are not actual MySQL servers, therefore of course cannot be promoted as masters

Orchestrator & Pseudo-GTID makes this hybrid topology still resilient:

  • Orchestrator understands the limitations on the hybrid topology and can salvage slaves of 1st tier Binlog Servers via Pseudo-GTID
  • In the case where the Binlog Servers were the most up to date slaves of a failed master, orchestrator knows to first move potential candidates under the Binlog Server and then extract them out again.
  • At this time Binlog Servers are still unstable. Pseudo-GTID allows us to comfortably test them on a large setup with reduced fear of losing slaves.

Otherwise orchestrator already understands pure Binlog Server topologies and can do master promotion. When pure binlog servers topologies will be in production orchestrator will be there to watch over.

Summary

To date, Pseudo-GTID has high scores in automated failovers of our topologies; orchestrator’s holistic approach makes for reliable diagnostics; together they reduce our dependency on specific servers & hardware, physical location, latency implied by SAN devices.

]]>
https://shlomi-noach.github.io/blog/mysql/state-of-automated-recovery-via-pseudo-gtid-orchestrator-booking-com/feed 0 7453