MySQL master discovery methods, part 1: DNS

This is the first in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via DNS

In DNS master discovery applications connect to the master via a name that gets resolved to the master’s box. By way of example, apps would target the masters of different clusters by connecting to cluster1-writer.example.net, cluster2-writer.example.net, etc. It is up for the DNS to resolve those names to IPs.

Continue reading » “MySQL master discovery methods, part 1: DNS”

orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more

orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:

Faster failure detection

Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator now detects failure faster than before:

  • A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every 5sec, failure detection time dropped from 7-10sec to 3-5sec, keeping reliability. The reduction in time does not lead to increased false positives.
    Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster).
  • Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, orchestrator now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec.

Faster master recoveries

Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.

With recent changes, orchestrator is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator is able to immediate promote it (i.e. run hooks, set read_only=0 etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.

This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.

Between faster detection and faster recoveries, we’re looking at some 10sec reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec in almost all cases, and < 15s in optimal cases. Those times are measured on our failover tests.

We are working on reducing failover time unrelated to orchestrator and hope to update soon.

Automated Pseudo-GTID

As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows. Continue reading » “orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more”

orchestrator 3.0.2 GA released: raft consensus, SQLite

orchestrator 3.0.2 GA is released and available for download (see also packagecloud repository).

3.0.2 is the first stable release in the 3.0* series, introducing (recap from 3.0 pre-release announcement):

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLiteis supported. orchestrator now ships with SQLite embedded, no external dependency needed.

For details, please refer to the documentation:

SQLite

Suggested and requested by many, is to remove orchestrator‘s own dependency on a MySQL backend. orchestrator now supports a SQLite backend.

SQLite is a transactional, relational, embedded database, and as of 3.0 it is embedded within orchestrator, no external dependency required.

orchestrator-client

orchestrator-client is a client shell script which mimics the command line interface, while running curl | jq requests against the HTTP API. It stands to simplify your deployments: interacting with the orchestrator service via orchestrator-client is easier and only requires you to place a shell script (this is as opposed to installing the orchestrator binary + configuration file).

orchestrator-client is the way to interact with your orchestrator/raft cluster. orchestrator-client now has its own RPM/deb release package.

You may still use the web interface, web API ; and a special --ignore-raft-setup keeps power at your hand (use at your own risk).

State of orchestrator/raft

orchestrator/raft is a big change: Continue reading » “orchestrator 3.0.2 GA released: raft consensus, SQLite”

Speaking at August Penguin, MySQL Track, GitHub sponsored

This Thursday I’ll be presenting at August Penguin, conveniently taking place September 7th, 8th, Ramat Gan, Israel.

I will be speaking as part of the MySQL track, 2nd half of Thursday. The (Hebrew) schedule is here.

My talk is titled Reliable failovers, safe schema migrations: open source solutions to MySQL problems. I will describe some of the open source MySQL infrastructure work we run at GitHub ; how it solves reliability, availability and usability. I’ll describe some of our internal workflows and our use of chat and chatops.

I’m proud to announce GitHub sponsors the event. We won’t have a booth, but please do grab me in the hallways or over lunch to chat!

And, yes, octocat stickers will be made available 🙂

 

orchestrator/raft: Pre-Release 3.0

orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script.

TL;DR

You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator‘s backend MySQL servers (such as Galera/InnoDB Cluster). In fact, you can run a orchestrator/raft setup using embedded SQLite backend DB. Read on.

orchestrator still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to 3.0 and do not configure raft.

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLite is supported. orchestrator now ships with SQLite embedded, no external dependency needed. Continue reading » “orchestrator/raft: Pre-Release 3.0”

What’s so complicated about a master failover?

The more work on orchestrator, the more user input and the more production experience, the more insights I get into MySQL master recoveries. I’d like to share the complexities in correctly running general-purpose master failovers; from picking up the right candidates to finalizing the promotion.

The TL;DR is: we’re often unaware of just how things can turn at the time of failover, and the impact of every single decision we make. Different environments have different requirements, and different users wish to have different policies. Understanding the scenarios can help you make the right choice.

The scenarios and considerations below are ones I picked while browsing through the orchestrator code and through Issues and questions. There are more. There are always more scenarios.

I discuss “normal replication” scenarios below; some of these will apply to synchronous replication setups (Galera, XtraDB Cluster, InnoDB Cluster) where using cross DC, where using intermediate masters, where working in an evolving environment.

orchestrator-wise, please refer to “MySQL High Availability tools” followup, the missing piece: orchestrator, an earlier post. Some notions from that post are re-iterated here. Continue reading » “What’s so complicated about a master failover?”

Observations on the hashicorp/raft library, and notes on RDBMS

The hashicorp/raft library is a Go library to provide consensus via Raft protocol implementation. It is the underlying library behind Hashicorp’s Consul.

I’ve had the opportunity to work with this library a couple projects, namely freno and orchestrator. Here are a few observations on working with this library:

  • TL;DR on Raft: a group communication protocol; multiple nodes communicate, elect a leader. A leader leads a consensus (any subgroup of more than half the nodes of the original group, or hopefully all of them). Nodes may leave and rejoin, and will remain consistent with consensus.
  • The hashicorp/raft library is an implementation of the Raft protocol. There are other implementations, and different implementations support different features.
  • The most basic premise is leader election. This is pretty straightforward to implement; you set up nodes to communicate to each other, and they elect a leader. You may query for the leader identity via Leader(), VerifyLeader(), or observing LeaderCh.
  • You have no control over the identity of the leader. You cannot “prefer” one node to be the leader. You cannot grab leadership from an elected leader, and you cannot demote a leader unless by killing it.
  • The next premise is gossip, sending messages between the raft nodes. With hashicorp/raft, only the leader may send messages to the group. This is done via the Apply() function.
  • Messages are nothing but blobs. Your app encodes the messages into []byte and ships it via raft. Receiving ends need to decode the bytes into a meaningful message.
  • You will check the result of Apply(), an ApplyFuture. The call to Error() will wait for consensus.
  • Just what is a message consensus? It’s a guarantee that the consensus of nodes has received and registered the message.
  • Messages form the raft log.
  • Messages are guaranteed to be handled in-order across all nodes.
  • The leader is satisfied when the followers receive the messages/log, but it cares not for their interpretation of the log.
  • The leader does not collect the output, or return value, of the followers applying of the log.
  • Consequently, your followers may not abort the message. They may not cast an opinion. They must adhere to the instruction received from the leader.
  • hashicorp/raft uses either an LMDB-based store or BoltDB for persisting your messages. Both are transactional stores.
  • Messages are expected to be idempotent: a node that, say, happens to restart, will request to join back the consensus (or to form a consensus with some other node). To do that, it will have to reapply historical messages that it may have applied in the past.
  • Number of messages (log entries) will grow infinitely. Snapshots are taken so as to truncate the log history. You will implement the snapshot dump & load.
  • A snapshot includes the log index up to which it covers.
  • Upon startup, your node will look for the most recent snapshot. It will read it, then resume replication from the aforementioned log index.
  • hashicorp/raft provides a file-system based snapshot implementation.

One of my use cases is completely satisfied with the existing implementations of BoltDB and of the filesystem snapshot.

However in another (orchestrator), my app stores its state in a relational backend. To that effect, I’ve modified the logstore and snapshot store. I’m using either MySQL or sqlite as backend stores for my app. How does that affect my raft use? Continue reading » “Observations on the hashicorp/raft library, and notes on RDBMS”

Practical Orchestrator, BoF, GitHub and other talks at Percona Live 2017

Next week I will be presenting Practical Orchestrator at Percona Live, Santa Clara.

As opposed to previous orchestrator talks I gave, and which were either high level or algorithmic talks, Practical Orchestrator will be, well… practical.

The objective for this talk is that attendees leave the classroom with a good grasp of orchestrator‘s powers, and know how to set up orchestrator in their environment.

We will walk through discovery, refactoring, recovery, HA. I will walk through the most important configuration settings, share advice on what makes a good deployment, and tell you how we and others run orchestrator. We’ll present a few scripting/automation examples. We will literally set up orchestrator on my computer.

It’s a 50 minute talk and it will be fast paced!

ProxySQL & Orchestrator BoF

ProxySQL is all the rage, and throughout the past 18 months René Cannaò and myself discussed a few times the potential for integration between ProxySQL and Orchestrator. We’ve also received several requests from the community.

We will run a BoF, a very informal session where we openly discuss our thoughts on possible integration, what makes sense and what doesn’t, and above all else would love to hear the attendees’ thoughts. We might come out of this session with some plan to pick low hanging fruit, who knows?

The current link to the BoF sessions is this. It seems terribly broken, and hopefully I’ll replace it later on.

GitHub talks

GitHub engineers will further present these talks: Continue reading » “Practical Orchestrator, BoF, GitHub and other talks at Percona Live 2017”

“MySQL High Availability tools” followup, the missing piece: orchestrator

I read with interest MySQL High Availability tools – Comparing MHA, MRM and ClusterControl by SeveralNines. I thought there was a missing piece in the comparison: orchestrator, and that as result the comparion was missing scope and context.

I’d like to add my thoughts on topics addressed in the post. I’m by no means an expert on MHA, MRM or ClusterControl, and will mostly focus on how orchestrator tackles high availability issues raised in the post.

What this is

This is to add insights on the complexity of failovers. Over the duration of three years, I always think I’ve seen it all, and then get hit by yet a new crazy scenario. Doing the right thing automatically is difficult.

In this post, I’m not trying to convince you to use orchestrator (though I’d be happy if you did). To be very clear, I’m not claiming it is better than any other tool. As always, each tool has pros and cons.

This post does not claim other tools are not good. Nor that orchestrator has all the answers. At the end of the day, pick the solution that works best for you. I’m happy to use a solution that reliably solves 99% of the cases as opposed to an unreliable solution that claims to solve 99.99% of the cases.

Quick background

orchestrator is actively maintained by GitHub. It manages automated failovers at GitHub. It manages automated failovers at Booking.com, one of the largest MySQL setups on this planet. It manages automated failovers as part of Vitess. These are some names I’m free to disclose, and browsing the issues shows a few more users running failovers in production. Otherwise, it is used for topology management and visualization in a large number of companies such as Square, Etsy, Sendgrid, Godaddy and more.

Let’s now follow one-by-one the observations on the SeveralNines post. Continue reading » ““MySQL High Availability tools” followup, the missing piece: orchestrator”

orchestrator Puppet module now available

We have just open sourced and published an orchestrator puppet module. This module is authored by Tom Krouper of GitHub’s database infrastructure team, and is what we use internally at GitHub for deploying orchestrator.

The module manages the orchestrator service, the config file (inherit to override values), etc (pun intended). Check it out!