Quick hack for GTID_OWN lack

One of the benefits of MySQL GTIDs is that each server remembers all GTID entries ever executed. Normally these would be ranges, e.g. 0041e600-f1be-11e9-9759-a0369f9435dc:1-3772242 or multi-ranges, e.g. 24a83cd3-e30c-11e9-b43d-121b89fcdde6:1-103775793, 2efbcca6-7ee1-11e8-b2d2-0270c2ed2e5a:1-356487160, 46346470-6561-11e9-9ab7-12aaa4484802:1-26301153, 757fdf0d-740e-11e8-b3f2-0a474bcf1734:1-192371670, d2f5e585-62f5-11e9-82a5-a0369f0ed504:1-10047.

One of the common problems in asynchronous replication is the issue of consistent reads. I’ve just written to the master. Is the data available on a replica yet? We have iterated on this, from reading on master, to heuristically finding up-to-date replicas based on heartbeats (see presentation and slides) via freno, and now settled, on some parts of our apps, to using GTID.

GTIDs are reliable as any replica can give you a definitive answer to the question: have you applied a given transaction or not?. Given a GTID entry, say f7b781a9-cbbd-11e9-affb-008cfa542442:12345, one may query for the following on a replica:

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:12345', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 1 |
+-------------------+

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:123450000', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 0 |
+-------------------+

Getting OWN_GTID

This is all well, but, given some INSERT or UPDATE on the master, how can I tell what’s the GTID associated with that transaction? There\s good news and bad news.

  • Good news is, you may SET SESSION session_track_gtids = OWN_GTID. This makes the MySQL protocol return the GTID generated by your transaction.
  • Bad news is, this isn’t a standard SQL response, and the common MySQL drivers offer you no way to get that information!

At GitHub we author our own Ruby driver, and have implemented the functionality to extract OWN_GTID, much like you’d extract LAST_INSERT_ID. But, how does one solve that without modifying the drivers? Here’s a poor person’s solution which gives you an inexact, but good enough, info. Following a write (insert, delete, create, …), run:

select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;

The idea is to “clean” the executed GTID set from irrelevant entries, by filtering out all ranges that do not belong to the server you’ve just written to (the master). The number 1000000000000000 stands for “high enough value that will never be reached in practice” – set to your own preferred value, but this value should take you beyond 300 years assuming 100,000 transactions per second.

Continue reading » “Quick hack for GTID_OWN lack”

Un-split brain MySQL via gh-mysql-rewind

We are pleased to release gh-mysql-rewind, a tool that allows us to move MySQL back in time, automatically identify and rewind split brain changes, restoring a split brain server into a healthy replication chain.

I recently had the pleasure of presenting gh-mysql-rewind at FOSDEM. Video and slides are available. Consider following along with the video.

Motivation

Consider a split brain scenario: a “standard” MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.

Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?

The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT values have been taken. UNIQUE constraints will fail.

A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario – which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.

We had to make a choice: we’ve already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?

Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.

Could we have reduces the time for recovery?

Continue reading » “Un-split brain MySQL via gh-mysql-rewind”

MySQL master discovery methods, part 5: Service discovery & Proxy

This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Service discovery and Proxy

Part 4 presented with an anti-pattern setup, where a proxy would infer the identify of the master by drawing conclusions from backend server checks. This led to split brains and undesired scenarios. The problem was the loss of context.

We re-introduce a service discovery component (illustrated in part 3), such that:

  • The app does not own the discovery, and
  • The proxy behaves in an expected and consistent way.

In a failover/service discovery/proxy setup, there is clear ownership of duties:

  • The failover tool own the failover itself and the master identity change notification.
  • The service discovery component is the source of truth as for the identity of the master of a cluster.
  • The proxy routes traffic but does not make routing decisions.
  • The app only ever connects to a single target, but should allow for a brief outage while failover takes place.

Depending on the technologies used, we can further achieve:

  • Hard cut for connections to old, demoted master M.
  • Black/hold off for incoming queries for the duration of failover.

We explain the setup using the following assumptions and scenarios:

  • All clients connect to master via cluster1-writer.example.net, which resolves to a proxy box.
  • We fail over from master M to promoted replica R.

Continue reading » “MySQL master discovery methods, part 5: Service discovery & Proxy”

MySQL master discovery methods, part 4: Proxy heuristics

Note: the method described here is an anti pattern

This is the fourth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Proxy Heuristics

In Proxy Heuristics all clients connect to the master through a proxy. The proxy observes the backend MySQL servers and determines who the master is.

This setup is simple and easy, but is an anti pattern. I recommend against using this method, as explained shortly.

Clients are all configured to connect to, say, cluster1-writer.proxy.example.net:3306. The proxy will intercept incoming requests either based on hostname or by port. It is aware of all/some MySQL backend servers in that cluster, and will route traffic to the master M.

A simple heuristic that I’ve seen in use is: pick the server that has read_only=0, a very simple check.

Let’s take a look at how this works and what can go wrong.

Continue reading » “MySQL master discovery methods, part 4: Proxy heuristics”

MySQL master discovery methods, part 3: app & service discovery

This is the third in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

App & service discovery

Part 1 and part 2 presented solutions where the app remained ingorant of master’s identity. This part takes a complete opposite direction and gives the app ownership on master access.

We introduce a service discovery component. Commonly known are Consul, ZooKeeper, etcd, highly available stores offering key/value (K/V) access, leader election or full blown service discovery & health.

We satisfy ourselves with K/V functionality. A key would be mysql/master/cluster1 and a value would be the master’s hostname/port.

It is the app’s responsibility at all times to fetch the identity of the master of a given cluster by querying the service discovery component, thereby opening connections to the indicated master.

The service discovery component is expected to be up at all times and to contain the identity of the master for any given cluster.

Continue reading » “MySQL master discovery methods, part 3: app & service discovery”

MySQL master discovery methods, part 2: VIP & DNS

This is the second in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via VIP

In part 1 we saw that one the main drawbacks of DNS discovery is the time it takes for the apps to connect to the promoted master. This is the result of both DNS deployment time as well as client’s TTL.

A quicker method is offered: use of VIPs (Virtual IPs). As before, apps would connect to cluster1-writer.example.net, cluster2-writer.example.net, etc. However, these would resolve to specific VIPs.

Say cluster1-writer.example.net resolves to 10.10.0.1. We let this address float between servers. Each server has its own IP (say 10.20.0.XXX) but could also potentially claim the VIP 10.10.0.1.

VIPs can be assigned by switches and I will not dwell into the internals, because I’m not a network expert. However, the following holds:

  • Acquiring a VIP is a very quick operation.
  • Acquiring a VIP must take place on the acquiring host.
  • A host may be unable to acquire a VIP should another host holds the same VIP.
  • A VIP can only be assigned within a bounded space: hosts connected to the same switch; hosts in the same Data Center or availability zone.

Continue reading » “MySQL master discovery methods, part 2: VIP & DNS”

MySQL master discovery methods, part 1: DNS

This is the first in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via DNS

In DNS master discovery applications connect to the master via a name that gets resolved to the master’s box. By way of example, apps would target the masters of different clusters by connecting to cluster1-writer.example.net, cluster2-writer.example.net, etc. It is up for the DNS to resolve those names to IPs.

Continue reading » “MySQL master discovery methods, part 1: DNS”

Using dbdeployer in CI tests

I was very pleased when Giuseppe Maxia (aka datacharmer) unveiled dbdeployer in his talk at pre-FOSDEM MySQL day. The announcement came just at the right time. I wish to briefly describe how we use dbdeployer (work in progress).

The case for gh-ost

A user opened an issue on gh-ost, and the user was using MySQL 5.5. gh-ost is being tested on 5.7 where the problem does not reproduce. A discussion with Gillian Gunson raised the concern of not testing on all versions. Can we run gh-ost tests for all MySQL/Percona/MariaDB versions? Should we? How easy would it be?

gh-ost tests

gh-ost has three different test types:

  • Unit tests: these are plain golang logic tests which are very easy and quick to run.
  • Integration tests: the topic of this post, see following. Today these do not run as part of an automated CI testing.
  • System tests: putting our production tables to the test, continuously migrating our production data on dedicated replicas, verifying checksums are identical and data is intact, read more.

Unit tests are already running as part of automated CI (every PR is subjected to those tests). Systems tests are clearly tied to our production servers. What’s the deal with the integration tests? Continue reading » “Using dbdeployer in CI tests”

orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more

orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:

Faster failure detection

Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator now detects failure faster than before:

  • A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every 5sec, failure detection time dropped from 7-10sec to 3-5sec, keeping reliability. The reduction in time does not lead to increased false positives.
    Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster).
  • Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, orchestrator now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec.

Faster master recoveries

Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.

With recent changes, orchestrator is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator is able to immediate promote it (i.e. run hooks, set read_only=0 etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.

This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.

Between faster detection and faster recoveries, we’re looking at some 10sec reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec in almost all cases, and < 15s in optimal cases. Those times are measured on our failover tests.

We are working on reducing failover time unrelated to orchestrator and hope to update soon.

Automated Pseudo-GTID

As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows. Continue reading » “orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more”

orchestrator/raft: Pre-Release 3.0

orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script.

TL;DR

You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator‘s backend MySQL servers (such as Galera/InnoDB Cluster). In fact, you can run a orchestrator/raft setup using embedded SQLite backend DB. Read on.

orchestrator still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to 3.0 and do not configure raft.

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLite is supported. orchestrator now ships with SQLite embedded, no external dependency needed. Continue reading » “orchestrator/raft: Pre-Release 3.0”