This is the first in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.
These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator
specific configuration/advice, and point out where cross DC orchestrator/raft
setup plays part in discovery itself, but for the most part any recovery tool such as MHA
, replication-manager
, severalnines
or other, is applicable.
We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).
Master discovery via DNS
In DNS master discovery applications connect to the master via a name that gets resolved to the master’s box. By way of example, apps would target the masters of different clusters by connecting to cluster1-writer.example.net
, cluster2-writer.example.net
, etc. It is up for the DNS to resolve those names to IPs.
Issues for concern are:
- You will likely have multiple DNS servers. How many? In which data centers / availability zones?
- What is your method for distributing/deploying a name change to all your DNS servers?
- DNS will indicate a
TTL
(Time To Live) such that clients can cache the IP associated with a name for a given number of seconds. What is thatTTL
?
As long as things are stable and going well, discovery via DNS makes sense. Trouble begins when the master fails over. Assume M
used to be the master, but got demoted. Assume R
used to be a replica, that got promoted and is now effectively the master of the topology.
Our failover solution has promoted R
, and now needs to somehow apply the change, such that the apps connect to R
instead of M
. Some notes:
- The apps need not change configuration. They should still connect to
cluster1-writer.example.net
,cluster2-writer.example.net
, etc. - Our tool instructs DNS servers to make the change.
- Clients will still resolve to old IP based on
TTL
.
A non planned failover illustration #1
Master M
dies. R
gets promoted. Our tool instructs all DNS servers on all DCs to update the IP address.
Say TTL
is 60
seconds. Say update to all DNS servers takes 10
seconds. We will have between 10
and 70
seconds until all clients connect to the new master R
.
During that time they will continue to attempt connecting to M
. Since M
is dead, those attempts will fail (thankfully).
A non planned failover illustration #2
Master M
gets network isolated for 30
seconds, during which time we failover. R
gets promoted. Our tool instructs all DNS servers on all DCs to update the IP address.
Again, assume TTL
is 60
seconds. As before, it will take between 10
and 70
seconds for clients to learn of the new IP.
Clients who will require between 40
and 70
seconds to learn of the new IP will, however, hit an unfortunate scenario: the old master M
reappears on the grid. Those clients will successfully reconnect to M
and issue writes, leading to data loss (writes to M
no longer replicate anywhere).
Planned failover illustration
We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R
. We need to change DNS records. Since this is a planned failover, we set the old master to read_only=1
, or even better, we network isolated it.
And still our clients take 10
to 70
seconds to recognize the new master.
Discussion
The above numbers are just illustrative. Perhaps DNS deployment is quicker than 10
seconds. You should do your own math.
TTL
is a compromise which you can tune. Setting lower TTL
will mitigate the problem, but will cause more hits on the DNS servers.
For planned takeover we can first deploy a change to the TTL
, to, say, 2sec
, wait 60sec
, then deploy the IP change, then restore TTL
to 60
.
You may choose to restart apps upon DNS deployment. This emulates apps’ awareness of the change.
Sample orchestrator configuration
orchestrator
configuration would look like this:
"ApplyMySQLPromotionAfterMasterFailover": true,
"PostMasterFailoverProcesses": [
"/do/what/you/gotta/do to apply dns change for {failureClusterAlias}-writer.example.net to {successorHost}"
],
In the above:
ApplyMySQLPromotionAfterMasterFailover
instructsorchestrator
toset read_only=0; reset slave all
on promoted server.PostMasterFailoverProcesses
really depends on your setup. Butorchestrator
will supply with hints to your scripts: identity of cluster, identity of successor.
See orchestrator configuration documentation.
All posts in this series
- MySQL master discovery methods, part 1: DNS
- MySQL master discovery methods, part 2: VIP & DNS
- MySQL master discovery methods, part 3: app & service discovery
- MySQL master discovery methods, part 4: Proxy heuristics
- MySQL master discovery methods, part 5: Service discovery & Proxy
- MySQL master discovery methods, part 6: other methods
One thought on “MySQL master discovery methods, part 1: DNS”