June 2017 – code.openark.org

The more work on orchestrator, the more user input and the more production experience, the more insights I get into MySQL master recoveries. I’d like to share the complexities in correctly running general-purpose master failovers; from picking up the right candidates to finalizing the promotion.

The TL;DR is: we’re often unaware of just how things can turn at the time of failover, and the impact of every single decision we make. Different environments have different requirements, and different users wish to have different policies. Understanding the scenarios can help you make the right choice.

The scenarios and considerations below are ones I picked while browsing through the orchestrator code and through Issues and questions. There are more. There are always more scenarios.

I discuss “normal replication” scenarios below; some of these will apply to synchronous replication setups (Galera, XtraDB Cluster, InnoDB Cluster) where using cross DC, where using intermediate masters, where working in an evolving environment.

orchestrator-wise, please refer to “MySQL High Availability tools” followup, the missing piece: orchestrator, an earlier post. Some notions from that post are re-iterated here. Continue reading » “What’s so complicated about a master failover?”

The hashicorp/raft library is a Go library to provide consensus via Raft protocol implementation. It is the underlying library behind Hashicorp’s Consul.

I’ve had the opportunity to work with this library a couple projects, namely freno and orchestrator. Here are a few observations on working with this library:

TL;DR on Raft: a group communication protocol; multiple nodes communicate, elect a leader. A leader leads a consensus (any subgroup of more than half the nodes of the original group, or hopefully all of them). Nodes may leave and rejoin, and will remain consistent with consensus.
The hashicorp/raft library is an implementation of the Raft protocol. There are other implementations, and different implementations support different features.
The most basic premise is leader election. This is pretty straightforward to implement; you set up nodes to communicate to each other, and they elect a leader. You may query for the leader identity via Leader(), VerifyLeader(), or observing LeaderCh.
You have no control over the identity of the leader. You cannot “prefer” one node to be the leader. You cannot grab leadership from an elected leader, and you cannot demote a leader unless by killing it.
The next premise is gossip, sending messages between the raft nodes. With hashicorp/raft, only the leader may send messages to the group. This is done via the Apply() function.
Messages are nothing but blobs. Your app encodes the messages into []byte and ships it via raft. Receiving ends need to decode the bytes into a meaningful message.
You will check the result of Apply(), an ApplyFuture. The call to Error() will wait for consensus.
Just what is a message consensus? It’s a guarantee that the consensus of nodes has received and registered the message.
Messages form the raft log.
Messages are guaranteed to be handled in-order across all nodes.
The leader is satisfied when the followers receive the messages/log, but it cares not for their interpretation of the log.
The leader does not collect the output, or return value, of the followers applying of the log.
Consequently, your followers may not abort the message. They may not cast an opinion. They must adhere to the instruction received from the leader.
hashicorp/raft uses either an LMDB-based store or BoltDB for persisting your messages. Both are transactional stores.
Messages are expected to be idempotent: a node that, say, happens to restart, will request to join back the consensus (or to form a consensus with some other node). To do that, it will have to reapply historical messages that it may have applied in the past.
Number of messages (log entries) will grow infinitely. Snapshots are taken so as to truncate the log history. You will implement the snapshot dump & load.
A snapshot includes the log index up to which it covers.
Upon startup, your node will look for the most recent snapshot. It will read it, then resume replication from the aforementioned log index.
hashicorp/raft provides a file-system based snapshot implementation.

One of my use cases is completely satisfied with the existing implementations of BoltDB and of the filesystem snapshot.

However in another (orchestrator), my app stores its state in a relational backend. To that effect, I’ve modified the logstore and snapshot store. I’m using either MySQL or sqlite as backend stores for my app. How does that affect my raft use? Continue reading » “Observations on the hashicorp/raft library, and notes on RDBMS”