MySQL master discovery methods, part 3: app & service discovery

This is the third in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

App & service discovery

Part 1 and part 2 presented solutions where the app remained ingorant of master’s identity. This part takes a complete opposite direction and gives the app ownership on master access.

We introduce a service discovery component. Commonly known are Consul, ZooKeeper, etcd, highly available stores offering key/value (K/V) access, leader election or full blown service discovery & health.

We satisfy ourselves with K/V functionality. A key would be mysql/master/cluster1 and a value would be the master’s hostname/port.

It is the app’s responsibility at all times to fetch the identity of the master of a given cluster by querying the service discovery component, thereby opening connections to the indicated master.

The service discovery component is expected to be up at all times and to contain the identity of the master for any given cluster.

A non planned failover illustration #1

Master M has died. R gets promoted in its place. Our recovery tool:

Updates the service discovery component, key is mysql/master/cluster1, value is R‘s hostname.

Clients:

Listen on K/V changes, recognize that master’s value has changed.
Reconfigure/refresh/reload/do what it takes to speak to new master and to drop connections to old master.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool (as before):

Updates the service discovery component, key is mysql/master/cluster1, value is R‘s hostname.

Clients (as before):

Listen on K/V changes, recognize that master’s value has changed.
Reconfigure/refresh/reload/do what it takes to speak to new master and to drop connections to old master.
Any changes not taking place in a timely manner imply some connections still use old master M.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

App should start connecting to R.

Discussion

The app is the complete owner. This calls for a few concerns:

How does a given app refresh and apply the change of master such that no stale connections are kept?
- Highly concurrent apps may be more difficult to manage.
In a polyglot app setup, you will need all clients to use the same setup. Implement same listen/refresh logic for Ruby, golang, Java, Python, Perl and notably shell scripts.
- The latter do not play well with such changes.
How can you validate that the change of master has been detected by all app nodes?

As for the service discovery:

What load will you be placing on your service discovery component?
- I was familiar with a setup where there were so many apps and app nodes and app instances, such that the amount of connections was too much for the service discovery . In that setup caching layers were created, which introduced their own consistency problems.
How do you handle service discovery outage?
- A reasonable approach is to keep using last known master idendity should service discovery be down. This, again, plays better wih higher level applications, but less so with scripts.

It is worth noting that this setup does not suffer from geographical limitations to the master’s identity. The master can be anywhere; the service discovery component merely points out where the master is.

Sample orchestrator configuration

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "KVClusterMasterPrefix": "mysql/master",
  "ConsulAddress": "127.0.0.1:8500",
  "ZkAddress": "srv-a,srv-b:12181,srv-c",
  "PostMasterFailoverProcesses": [
    “/just/let/me/know about failover on {failureCluster}“,
  ],

In the above:

If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.