Replication – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Mon, 11 May 2020 06:28:06 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 Quick hack for GTID_OWN lack https://shlomi-noach.github.io/blog/mysql/quick-hack-for-gtid_own-lack https://shlomi-noach.github.io/blog/mysql/quick-hack-for-gtid_own-lack#respond Wed, 11 Dec 2019 08:00:00 +0000 https://shlomi-noach.github.io/blog/?p=7974 One of the benefits of MySQL GTIDs is that each server remembers all GTID entries ever executed. Normally these would be ranges, e.g. 0041e600-f1be-11e9-9759-a0369f9435dc:1-3772242 or multi-ranges, e.g. 24a83cd3-e30c-11e9-b43d-121b89fcdde6:1-103775793, 2efbcca6-7ee1-11e8-b2d2-0270c2ed2e5a:1-356487160, 46346470-6561-11e9-9ab7-12aaa4484802:1-26301153, 757fdf0d-740e-11e8-b3f2-0a474bcf1734:1-192371670, d2f5e585-62f5-11e9-82a5-a0369f0ed504:1-10047.

One of the common problems in asynchronous replication is the issue of consistent reads. I’ve just written to the master. Is the data available on a replica yet? We have iterated on this, from reading on master, to heuristically finding up-to-date replicas based on heartbeats (see presentation and slides) via freno, and now settled, on some parts of our apps, to using GTID.

GTIDs are reliable as any replica can give you a definitive answer to the question: have you applied a given transaction or not?. Given a GTID entry, say f7b781a9-cbbd-11e9-affb-008cfa542442:12345, one may query for the following on a replica:

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:12345', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 1 |
+-------------------+

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:123450000', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 0 |
+-------------------+

Getting OWN_GTID

This is all well, but, given some INSERT or UPDATE on the master, how can I tell what’s the GTID associated with that transaction? There\s good news and bad news.

  • Good news is, you may SET SESSION session_track_gtids = OWN_GTID. This makes the MySQL protocol return the GTID generated by your transaction.
  • Bad news is, this isn’t a standard SQL response, and the common MySQL drivers offer you no way to get that information!

At GitHub we author our own Ruby driver, and have implemented the functionality to extract OWN_GTID, much like you’d extract LAST_INSERT_ID. But, how does one solve that without modifying the drivers? Here’s a poor person’s solution which gives you an inexact, but good enough, info. Following a write (insert, delete, create, …), run:

select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;

The idea is to “clean” the executed GTID set from irrelevant entries, by filtering out all ranges that do not belong to the server you’ve just written to (the master). The number 1000000000000000 stands for “high enough value that will never be reached in practice” – set to your own preferred value, but this value should take you beyond 300 years assuming 100,000 transactions per second.

The value you get is the range on the master itself. e.g.:

mysql> select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
+-------------------------------------------------+
| master_generated_gtid                           |
+-------------------------------------------------+
| dc103953-1598-11ea-82a7-008cfa5440e4:1-35807176 |
+-------------------------------------------------+

You may further parse the above to extract dc103953-1598-11ea-82a7-008cfa5440e4:35807176 if you want to hold on to the latest GTID entry. Now, this entry isn’t necessarily your own. Between the time of your write and the time of your GTID query, other writes will have taken place. But the entry you get is either your own or a later one. If you can find that entry on a replica, that means your write is included on the replica.

One may wonder, why do we need to extract the value at all? Why not just select @@global.gtid_executed? Why filter only the master‘s UUID? Logically, the answer is the same if you do that. But in practice, your query may be unfortunate enough to return some:

select @@global.gtid_executed \G

e71f0cdb-b8ef-11e9-9361-008cfa542442:1-83331,
e742d87f-dea7-11e9-be6d-008cfa542c9e:1-18485,
e7880c0e-ac54-11e9-865a-008cfa544064:1-7331973,
e82043c6-c7d9-11e9-9413-008cfa5440e4:1-61692,
e902678b-b046-11e9-a281-008cfa542c9e:1-83108,
e90d7ff9-e35e-11e9-a9a0-008cfa544064:1-18468,
e929a635-bb40-11e9-9c0d-008cfa5440e4:1-139348,
e9351610-ef1b-11e9-9db4-008cfa5440e4:1-33460918,
e938578d-dc41-11e9-9696-008cfa542442:1-18232,
e947f165-cd53-11e9-b7a1-008cfa5440e4:1-18480,
e9733f37-d537-11e9-8604-008cfa5440e4:1-18396,
e97a0659-e423-11e9-8433-008cfa542442:1-18237,
e98dc1f7-e0f8-11e9-9bbd-008cfa542c9e:1-18482,
ea16027a-d20e-11e9-9845-008cfa542442:1-18098,
ea1e1aa6-e74a-11e9-a7f2-008cfa544064:1-18450,
ea8bc1bd-dd06-11e9-a10c-008cfa542442:1-18203,
eae8c750-aaca-11e9-b17c-008cfa544064:1-85990,
eb1e41e9-af81-11e9-9ceb-008cfa544064:1-86220,
eb3c9b3b-b698-11e9-b67a-008cfa544064:1-18687,
ec6daf7e-b297-11e9-a8a0-008cfa542c9e:1-80652,
eca4af92-c965-11e9-a1f3-008cfa542c9e:1-18333,
ecd110b9-9647-11e9-a48f-008cfa544064:1-24213,
ed26890e-b10b-11e9-a79d-008cfa542c9e:1-83450,
ed92b3bf-c8a0-11e9-8612-008cfa542442:1-18223,
eeb60c82-9a3d-11e9-9ea5-008cfa544064:1-1943152,
eee43e06-c25d-11e9-ba23-008cfa542442:1-105102,
eef4a7fb-b438-11e9-8d4b-008cfa5440e4:1-74717,
eefdbd3b-95b3-11e9-833d-008cfa544064:1-39415,
ef087062-ba7b-11e9-92de-008cfa5440e4:1-9726172,
ef507ff0-98b3-11e9-8b15-008cfa5440e4:1-928030,
ef662471-9a3b-11e9-bd2e-008cfa542c9e:1-954800,
f002e9f7-97ee-11e9-bed0-008cfa542c9e:1-5180743,
f0233228-e9a1-11e9-a142-008cfa542c9e:1-18583,
f04780c4-a864-11e9-9f28-008cfa542c9e:1-83609,
f048acd9-b1d2-11e9-a0b6-008cfa544064:1-70663,
f0573d8c-9978-11e9-9f73-008cfa542c9e:1-85642135,
f0b0a37c-c89c-11e9-804c-008cfa5440e4:1-18488,
f0cfe1ac-e5af-11e9-bc09-008cfa542c9e:1-18552,
f0e4997c-cbc9-11e9-9179-008cfa542442:1-1655552,
f24e481c-b5c4-11e9-aff0-008cfa5440e4:1-83015,
f4578c4b-be6d-11e9-982e-008cfa5440e4:1-132701,
f48bce80-e99f-11e9-94f4-a0369f9432f4:1-18460,
f491adf1-9b04-11e9-bc71-008cfa542c9e:1-962823,
f5d3db74-a929-11e9-90e8-008cfa5440e4:1-75379,
f6696ba7-b750-11e9-b458-008cfa542c9e:1-83096,
f714cb4c-dab7-11e9-adb9-008cfa544064:1-18413,
f7b781a9-cbbd-11e9-affb-008cfa542442:1-18169,
f81f7729-b10d-11e9-b29b-008cfa542442:1-86820,
f88a3298-e903-11e9-88d0-a0369f9432f4:1-18548,
f9467b29-d78c-11e9-b1a2-008cfa5440e4:1-18492,
f9c08f5c-e4ea-11e9-a76c-008cfa544064:1-1667611,
fa633abf-cee3-11e9-9346-008cfa542442:1-18361,
fa8b0e64-bb42-11e9-9913-008cfa542442:1-140089,
fa92234c-cc90-11e9-b337-008cfa544064:1-18324,
fa9755eb-e425-11e9-907d-008cfa542c9e:1-1668270,
fb7843d5-eb38-11e9-a1ff-a0369f9432f4:1-1668957,
fb8ceae5-dd08-11e9-9ed3-008cfa5440e4:1-18526,
fbf9970e-bc07-11e9-9e4f-008cfa5440e4:1-136157,
fc0ffaee-98b1-11e9-8574-008cfa542c9e:1-940999,
fc9bf1e4-ee54-11e9-9ce9-008cfa542c9e:1-18189,
fca4672f-ac56-11e9-8a83-008cfa542442:1-82014,
fcebaa05-dab5-11e9-8356-008cfa542c9e:1-18490,
fd0c88b1-ad1b-11e9-bf3a-008cfa5440e4:1-75167,
fd394feb-e4e4-11e9-bd09-008cfa5440e4:1-18574,
fd687577-b048-11e9-b429-008cfa542442:1-83479,
fdb18995-a79f-11e9-a28d-008cfa542442:1-82351,
fdc72b7f-b696-11e9-ade9-008cfa544064:1-57674,
ff1f3b6b-c967-11e9-ae04-008cfa544064:1-18503,
ff6fe7dc-c186-11e9-9bb4-008cfa5440e4:1-103192,
fff9dd94-ed95-11e9-90b7-008cfa544064:1-911039

This can happen when you fail over to a new master, multiple times; it happens when you don’t recycle UUIDs, when you provision new hosts and let MySQL pick their UUID. Returning this amount of data per query is an excessive overhead, hence why we extract the master‘s UUID only, which is guaranteed to be limited in size.

]]>
https://shlomi-noach.github.io/blog/mysql/quick-hack-for-gtid_own-lack/feed 0 7974
Un-split brain MySQL via gh-mysql-rewind https://shlomi-noach.github.io/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind https://shlomi-noach.github.io/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind#respond Tue, 05 Mar 2019 13:51:43 +0000 https://shlomi-noach.github.io/blog/?p=7928 We are pleased to release gh-mysql-rewind, a tool that allows us to move MySQL back in time, automatically identify and rewind split brain changes, restoring a split brain server into a healthy replication chain.

I recently had the pleasure of presenting gh-mysql-rewind at FOSDEM. Video and slides are available. Consider following along with the video.

Motivation

Consider a split brain scenario: a “standard” MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.

Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?

The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT values have been taken. UNIQUE constraints will fail.

A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario – which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.

We had to make a choice: we’ve already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?

Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.

Could we have reduces the time for recovery?

There are multiple ways to do that: local backups, local delayed replicas, snapshots… We have embarked on several. In this post I wish to outline gh-mysql-rewind, which programmatically identifies the rogue (aka “bad”) transactions on the network isolated master, rewinds/reverts them, applies some bookkeeping and restores the demoted master as a healthy replica under the newly promoted master, thereby prepared to be promoted if needed.

General overview

gh-mysql-rewind is a shell script. It utilizes multiple technologies, some of which do not speak to each other, to be able to do the magic. It assumes and utilizes the following:

Some breakdown follows.

GTID

MySQL GTIDs keep track of all transactions executed on a given server. GTIDs indicate which server (UUID) originated a write, and ranges of transaction sequences. In a clean state, only one writer will generate GTIDs, and on all the replicas we would see the same GTID set, originated with the writer’s UUID.

In a split brain scenario, we would see divergence. It is possible to use GTID_SUBTRACT(old_master-GTIDs, new-master-GTIDs) to identify the exact set of transactions executed on the old, demoted master, right after the failover. This is the essence of the split brain.

For example, assume that just before the network partition, GTID on the master was 00020192-1111-1111-1111-111111111111:1-5000. Assume after the network partition the new master has UUID of 00020193-2222-2222-2222-222222222222. It began to take writes, and after some time its GTID set showed 00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200.

On the demoted master, other writes took place, leading to the GTID set 00020192-1111-1111-1111-111111111111:1-5042.

We will run…

SELECT GTID_SUBTRACT(
  '00020192-1111-1111-1111-111111111111:1-5042',
  '00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200'
);

> '00020192-1111-1111-1111-111111111111:5001-5042'

…to identify the exact set of “bad transactions” on the demoted master.

Row Based Replication

With row based replication, and with FULL image format, each DML (INSERT, UPDATE, DELETE) writes to the binary log the complete row data before and after the operation. This means the binary log has enough information for us to revert the operation.

Flashback

Developed by Alibaba, flashback has been incorporated in MariaDB. MariaDB’s mysqlbinlog utility supports a --flashback flag, which interprets the binary log in a special way. Instead of printing out the events in the binary log in order, it prints the inverted operations in reverse order.

To illustrate, let’s assume this pseudo-code sequence of events in the binary log:

insert(1, 'a')
insert(2, 'b')
insert(3, 'c')
update(2, 'b')->(2, 'second')
update(3, 'c')->(3, 'third')
insert(4, 'd')
delete(1, 'a')

A --flashback of this binary log would produce:

insert(1, 'a')
delete(4, 'd')
update(3, 'third')->(3, 'c')
update(2, 'second')->(2, 'b')
delete(3, 'c')
delete(2, 'b')
delete(1, 'a')

Alas, MariaDB and flashback do not speak MySQL GTID language. GTIDs are one of the major points where MySQL and MariaDB have diverged beyond compatibility.

The output of MariaDB’s mysqlbinlog --flashback has neither any mention of GTIDs, nor does the tool take notice of GTIDs in the binary logs in the first place.

gh-mysql-rewind

This is where we step in. GTIDs provide the information about what went wrong. flashback has the mechanism to generate the reverse sequence of statements. gh-mysql-rewind:

  • uses GTIDs to detect what went wrong
  • correlates those GTID entries with binary log files: identifies which binary logs actually contain those GTID events
  • invokes MariaDB’s mysqlbinlog --flashback to generate the reverse of those binary logs
  • injects (dummy) GTID information into the output
  • computes ETA

This last part is worth elaborating. We have created a time machine. We have the mechanics to make it work. But as any Sci-Fi fan knows, one of the most important parts of time travel is knowing ahead where (when) you are going to land. Are you back in the Renaissance? Or are you suddenly to appear on board the French Revolution? Better dress accordingly.

In our scenario it is not enough to move MySQL back in time to some consistent state. We want to know at what time we landed, so that we can instruct the rewinded server to join the replication chain as a healthy replica. In MySQL terms, we need to make MySQL “forget” everything that ever happened after the split brain: not only in terms of data (which we already did), but in terms of GTID history.

gh-mysql-rewind will do the math to project, ahead of time, at what “time” (i.e. GTID set) our time machine arrived. It will issue a `RESET MASTER; SET GLOBAL gtid_purged=’gtid-of-the-landing-time'” to make our re-winded MySQL consistent not only with some past dataset, but also with its own perception of the point in time where that dataset existed.

Limitations

Some limitations are due to MariaDB’s incompatibility with MySQL, some are due to MySQL DDL nature, some due to the fact gh-mysql-rewind is a shell script.

  • Cannot rewind DDL. DDLs are silently ignored, and will impose a problem when trying to re-apply them.
  • JSON, POINT data types are not supported.
  • The logic rewinds the MySQL server farther into the past than strictly required. This simplifies the code considerably, but imposed superfluous time to rewind+reapply, i.e. time to recover.
  • Currently, this only works one server at a time. If a group of 10 servers were network isolated together, the operation would need to run on each of these 10 servers.
  • Runs locally on each server. Requires both MySQL’s mysqlbinlog as well as MariaDB’s mysqlbinlog.

Testing

There’s lot of moving parts to this mechanism. A mixture of technologies that don’t normally speak to each other, injection of data, prediction of ETA… How reliable is all this?

We run continuous gh-mysql-rewind testing in production to consistently prove that it works as expected. Our testing uses a non-production, dedicated, functional replica. It contaminates the data on the replica. It lets gh-mysql-rewind automatically move it back in time, it joins the replica back into the healthy chain.

That’s not enough. We actually create a scenario where we can predict, ahead of testing, what the time-of-arrival will be. We checksum the data on that replica at that time. After contaminating and effectively breaking replication, we expect gh-mysql-rewind to revert the changes back to our predicted point in time. We checksum the data again. We expect 100% match.

See the video or slides for more detail on our testing setup.

Status

At this time the tool in one of several solutions we hope to never need to employ. It is stable and tested. We are looking forward to a promising MySQL development that will provide GTID-revert capabilities using standard commands, such as SELECT undo_transaction('00020192-1111-1111-1111-111111111111:5042').

We have released gh-mysql-rewind as open source, under the MIT license. The public release is a stripped down version of our own script, which has some GitHub-specific integration. We have general ideas in incorporating this functionality into higher level tools.

gh-mysql-rewind is developed by the database-infrastructure team at GitHub.

]]>
https://shlomi-noach.github.io/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind/feed 0 7928
MySQL master discovery methods, part 5: Service discovery & Proxy https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy#respond Mon, 14 May 2018 08:08:32 +0000 https://shlomi-noach.github.io/blog/?p=7869 This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Service discovery and Proxy

Part 4 presented with an anti-pattern setup, where a proxy would infer the identify of the master by drawing conclusions from backend server checks. This led to split brains and undesired scenarios. The problem was the loss of context.

We re-introduce a service discovery component (illustrated in part 3), such that:

  • The app does not own the discovery, and
  • The proxy behaves in an expected and consistent way.

In a failover/service discovery/proxy setup, there is clear ownership of duties:

  • The failover tool own the failover itself and the master identity change notification.
  • The service discovery component is the source of truth as for the identity of the master of a cluster.
  • The proxy routes traffic but does not make routing decisions.
  • The app only ever connects to a single target, but should allow for a brief outage while failover takes place.

Depending on the technologies used, we can further achieve:

  • Hard cut for connections to old, demoted master M.
  • Black/hold off for incoming queries for the duration of failover.

We explain the setup using the following assumptions and scenarios:

  • All clients connect to master via cluster1-writer.example.net, which resolves to a proxy box.
  • We fail over from master M to promoted replica R.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Updates service discovery component that R is the new master for cluster1.

The proxy:

  • Either actively or passively learns that R is the new master, rewires all writes to go to R.
  • If possible, kills existing connections to M.

The app:

  • Needs to know nothing. Its connections to M fail, it reconnects and gets through to R.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted.

Everything is as before.

If the proxy kills existing connections to M, then the fact M is back alive turns meaningless. No one gets through to M. Clients were never aware of its identity anyhow, just as they are unaware of R‘s identity.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • In the process of promotion, M turned read-only.
  • Immediately following promotion, our failover tool updates service discovery.
  • Proxy reloads having seen the changes in service discovery.
  • Our app connects to R.

Discussion

This is a setup we use at GitHub in production. Our components are:

  • orchestrator for failover tool.
  • Consul for service discovery.
  • GLB (HAProxy) for proxy
  • Consul template running on proxy hosts:
    • listening on changes to Consul’s KV data
    • Regenerate haproxy.cfg configuration file
    • reload haproxy

As mentioned earlier, the apps need not change anything. They connect to a name that is always resolved to proxy boxes. There is never a DNS change.

At the time of failover, the service discovery component must be up and available, to catch the change. Otherwise we do not strictly require it to be up at all times.

For high availability we will have multiple proxies. Each of whom must listen on changes to K/V. Ideally the name (cluster1-writer.example.net in our example) resolves to any available proxy box.

  • This, in itself, is a high availability issue. Thankfully, managing the HA of a proxy layer is simpler than that of a MySQL layer. Proxy servers tend to be stateless and equal to each other.
  • See GLB as one example for a highly available proxy layer. Cloud providers, Kubernetes, two level layered proxies, Linux Heartbeat, are all methods to similarly achieve HA.

See also:

Sample orchestrator configuration

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "KVClusterMasterPrefix": "mysql/master",
  "ConsulAddress": "127.0.0.1:8500",
  "ZkAddress": "srv-a,srv-b:12181,srv-c",
  "PostMasterFailoverProcesses": [
    “/just/let/me/know about failover on {failureCluster}“,
  ],

In the above:

  • If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
  • At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
  • PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy/feed 0 7869
MySQL master discovery methods, part 4: Proxy heuristics https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics#respond Thu, 10 May 2018 06:10:34 +0000 https://shlomi-noach.github.io/blog/?p=7867 Note: the method described here is an anti pattern

This is the fourth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Proxy Heuristics

In Proxy Heuristics all clients connect to the master through a proxy. The proxy observes the backend MySQL servers and determines who the master is.

This setup is simple and easy, but is an anti pattern. I recommend against using this method, as explained shortly.

Clients are all configured to connect to, say, cluster1-writer.proxy.example.net:3306. The proxy will intercept incoming requests either based on hostname or by port. It is aware of all/some MySQL backend servers in that cluster, and will route traffic to the master M.

A simple heuristic that I’ve seen in use is: pick the server that has read_only=0, a very simple check.

Let’s take a look at how this works and what can go wrong.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.

Success, we’re happy.

Configuration tip

With an automated failover solution, use read_only=1 in my.cnf at all times. Only the failover solution will set a server to read_only=0.

With this configuration, when M restarts, MySQL starts up as read_only=1.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • 10 seconds later M comes back to life, claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.
  • The proxy has no context. It does not know why both are reporting the same. It is unaware of failovers. All it sees is what the backend MySQL servers report.

Therein lies the problem: you can not trust multiple servers (MySQL backends) to deterministically pick a leader (the master) without them collaborating on some elaborate consensus communication.

A non planned failover illustration #3

Master M box is overloaded, issuing too many connections for incoming connections.

Our tool decides to failover.

  • And doesn’t need to run any hooks.

The proxy:

  • Notices M fails health checks (select @@global.read_only does not respond because of the load).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • Shortly followed by M recovering (since no more writes are sent its way), claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.

Again, the proxy has no context, and neither do M and R, for that matter. The context (the fact we failed over from M to R) was known to our failover tool, but was lost along the way.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we set it to read_only=1.
  • We set R to read_only=0.
  • All new connections route to R.
  • We should also instruct our Proxy to kill all previous connections to M.

This works very nicely.

Discussion

There is a substantial risk to this method. Correlation between failover and network partitioning/load (illustrations #2 and #3) is reasonable.

The root of the problem is that we expect individual servers to resolve conflicts without speaking to each other: we expect the MySQL servers to correctly claim “I’m the master” without context.

We then add to that problem by using the proxy to “pick a side” without giving it any context, either.

Sample orchestrator configuration

By way of discouraging use of this method I do not present an orchestrator configuration file.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics/feed 0 7867
MySQL master discovery methods, part 3: app & service discovery https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-3-app-service-discovery https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-3-app-service-discovery#respond Tue, 08 May 2018 08:02:19 +0000 https://shlomi-noach.github.io/blog/?p=7865 This is the third in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

App & service discovery

Part 1 and part 2 presented solutions where the app remained ingorant of master’s identity. This part takes a complete opposite direction and gives the app ownership on master access.

We introduce a service discovery component. Commonly known are Consul, ZooKeeper, etcd, highly available stores offering key/value (K/V) access, leader election or full blown service discovery & health.

We satisfy ourselves with K/V functionality. A key would be mysql/master/cluster1 and a value would be the master’s hostname/port.

It is the app’s responsibility at all times to fetch the identity of the master of a given cluster by querying the service discovery component, thereby opening connections to the indicated master.

The service discovery component is expected to be up at all times and to contain the identity of the master for any given cluster.

A non planned failover illustration #1

Master M has died. R gets promoted in its place. Our recovery tool:

  • Updates the service discovery component, key is mysql/master/cluster1, value is R‘s hostname.

Clients:

  • Listen on K/V changes, recognize that master’s value has changed.
  • Reconfigure/refresh/reload/do what it takes to speak to new master and to drop connections to old master.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool (as before):

  • Updates the service discovery component, key is mysql/master/cluster1, value is R‘s hostname.

Clients (as before):

  • Listen on K/V changes, recognize that master’s value has changed.
  • Reconfigure/refresh/reload/do what it takes to speak to new master and to drop connections to old master.
  • Any changes not taking place in a timely manner imply some connections still use old master M.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • App should start connecting to R.

Discussion

The app is the complete owner. This calls for a few concerns:

  • How does a given app refresh and apply the change of master such that no stale connections are kept?
    • Highly concurrent apps may be more difficult to manage.
  • In a polyglot app setup, you will need all clients to use the same setup. Implement same listen/refresh logic for Ruby, golang, Java, Python, Perl and notably shell scripts.
    • The latter do not play well with such changes.
  • How can you validate that the change of master has been detected by all app nodes?

As for the service discovery:

  • What load will you be placing on your service discovery component?
    • I was familiar with a setup where there were so many apps and app nodes and app instances, such that the amount of connections was too much for the service discovery . In that setup caching layers were created, which introduced their own consistency problems.
  • How do you handle service discovery outage?
    • A reasonable approach is to keep using last known master idendity should service discovery be down. This, again, plays better wih higher level applications, but less so with scripts.

It is worth noting that this setup does not suffer from geographical limitations to the master’s identity. The master can be anywhere; the service discovery component merely points out where the master is.

Sample orchestrator configuration

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "KVClusterMasterPrefix": "mysql/master",
  "ConsulAddress": "127.0.0.1:8500",
  "ZkAddress": "srv-a,srv-b:12181,srv-c",
  "PostMasterFailoverProcesses": [
    “/just/let/me/know about failover on {failureCluster}“,
  ],

In the above:

  • If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
  • At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
  • PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-3-app-service-discovery/feed 0 7865
MySQL master discovery methods, part 2: VIP & DNS https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-2-vip-dns https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-2-vip-dns#comments Mon, 07 May 2018 06:46:22 +0000 https://shlomi-noach.github.io/blog/?p=7863 This is the second in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via VIP

In part 1 we saw that one the main drawbacks of DNS discovery is the time it takes for the apps to connect to the promoted master. This is the result of both DNS deployment time as well as client’s TTL.

A quicker method is offered: use of VIPs (Virtual IPs). As before, apps would connect to cluster1-writer.example.net, cluster2-writer.example.net, etc. However, these would resolve to specific VIPs.

Say cluster1-writer.example.net resolves to 10.10.0.1. We let this address float between servers. Each server has its own IP (say 10.20.0.XXX) but could also potentially claim the VIP 10.10.0.1.

VIPs can be assigned by switches and I will not dwell into the internals, because I’m not a network expert. However, the following holds:

  • Acquiring a VIP is a very quick operation.
  • Acquiring a VIP must take place on the acquiring host.
  • A host may be unable to acquire a VIP should another host holds the same VIP.
  • A VIP can only be assigned within a bounded space: hosts connected to the same switch; hosts in the same Data Center or availability zone.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Attempts to connect to M so that it can give up the VIP. The attempt fails because M is dead.
  • Connects to R and instructs it to acquire the VIP. Since M is dead there is no objection, and R successfully grabs the VIP.
  • Any new connections immediately route to the new master R.
  • Clients with connections to M cannot connect, issue retries, immediately route to R.

A non planned failover illustration #2

Master M gets network isolated for 30 seconds, during which time we failover. R gets promoted. Our tool:

  • Attempts to connect to M so that it can give up the VIP. The attempt fails because M is network isolated.
  • Connects to R and instructs it to acquire the VIP. Since M is network isolated there is no objection, and R successfully grabs the VIP.
  • Any new connections immediately route to the new master R.
  • Clients with connections to M cannot connect, issue retries, immediately route to R.
  • 30 seconds later M reappears, but no one pays any attention.

A non planned failover illustration #3

Master M box is overloaded. It is not responsive to new connections but may slowly serves existing connections. Our tool decides to failover:

  • Attempts to connect to M so that it can give up the VIP. The attempt fails because M is very loaded.
  • Connects to R and instructs it to acquire the VIP. Unfortunately, M hasn’t given up the VIP and still shows up as owning it.
  • All existing and new connections keep on routing to M, even as R is the new master.
  • This continues until some time has passed and we are able to manually grab the VIP on R, or until we forcibly network isolate M or forcibly shut it down.

We suffer outage.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we ask it to give up the VIP, which is does.
  • We ask R to grab the VIP, which it does.
  • All new connections route to R.
  • We may still see old connections routing to M. We can forcibly network isolate M to break those connections so as to cause reconnects, or restart apps.

Discussion

As with DNS discovery, the apps are never told of the change. They may be forcibly restarted though.

Grabbing a VIP is a quick operation. However, consider:

  • It is not guaranteed to succeed. I have seen it fail in various situations.
  • Since releasing/acquiring of VIP can only take place on the demoted/promoted servers, respectively, our failover tool will need to:
    • Remote SSH onto both boxes, or
    • Remote exec a command on those boxes
  • Moreover, the tool will do so sequentially. First we must connect to demoted master to give up the VIP, then to promoted master to acquire it.
  • This means the time at which the new master grabs the VIP depends on how long it takes to connect to the old master to give up the VIP. Seeing that the old master had trouble causing failover, we can expect correlation to not being able to connect to old master, or seeing slow connect time.
  • An alternative exists, in the form of Pacemaker. Consider Percona’s Replication Manager guide for more insights. Pacemaker provides a single point of access from where the VIP can be moved, and behind the scenes it will communicate to relevant nodes. This makes it simpler on the failover solution configuration.
  • We are constrained by physical location.
  • It is still possible for existing connection to keep on communicating to the demoted master, even while the VIP has been moved.

VIP & DNS combined

Per physical location, we could choose to use VIP. But should we need to failover to a server in another DC, we could choose to combine the DNS discovery, discussed in part 1.

We can expect to see faster failover time on a local physical location, and longer failover time on remote location.

Sample orchestrator configuration

What kind of remote exec method will you have? In this sample we will use remote (passwordless) SSH.

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PostMasterFailoverProcesses": [
    "ssh {failedHost} 'sudo ifconfig the-vip-interface down'",
    "ssh {successorHost} 'sudo ifconfig the-vip-interface up'",
    "/do/what/you/gotta/do to apply dns change for {failureClusterAlias}-writer.example.net to {successorHost}"
  ],  

In the above:

  • Replace SSH with any remote exec method you may use.
    • But you will need to set up the access/credentials for orchestrator to run those operations.
  • Replace ifconfig with service quagga stop/start or any method you use to release/grab VIPs.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-2-vip-dns/feed 3 7863
MySQL master discovery methods, part 1: DNS https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-1-dns https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-1-dns#respond Thu, 03 May 2018 10:56:46 +0000 https://shlomi-noach.github.io/blog/?p=7861 This is the first in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via DNS

In DNS master discovery applications connect to the master via a name that gets resolved to the master’s box. By way of example, apps would target the masters of different clusters by connecting to cluster1-writer.example.net, cluster2-writer.example.net, etc. It is up for the DNS to resolve those names to IPs.

Issues for concern are:

  • You will likely have multiple DNS servers. How many? In which data centers / availability zones?
  • What is your method for distributing/deploying a name change to all your DNS servers?
  • DNS will indicate a TTL (Time To Live) such that clients can cache the IP associated with a name for a given number of seconds. What is that TTL?

As long as things are stable and going well, discovery via DNS makes sense. Trouble begins when the master fails over. Assume M used to be the master, but got demoted. Assume R used to be a replica, that got promoted and is now effectively the master of the topology.

Our failover solution has promoted R, and now needs to somehow apply the change, such that the apps connect to R instead of M. Some notes:

  • The apps need not change configuration. They should still connect to cluster1-writer.example.net, cluster2-writer.example.net, etc.
  • Our tool instructs DNS servers to make the change.
  • Clients will still resolve to old IP based on TTL.

A non planned failover illustration #1

Master M dies. R gets promoted. Our tool instructs all DNS servers on all DCs to update the IP address.

Say TTL is 60 seconds. Say update to all DNS servers takes 10 seconds. We will have between 10 and 70 seconds until all clients connect to the new master R.

During that time they will continue to attempt connecting to M. Since M is dead, those attempts will fail (thankfully).

A non planned failover illustration #2

Master M gets network isolated for 30 seconds, during which time we failover. R gets promoted. Our tool instructs all DNS servers on all DCs to update the IP address.

Again, assume TTL is 60 seconds. As before, it will take between 10 and 70 seconds for clients to learn of the new IP.

Clients who will require between 40 and 70 seconds to learn of the new IP will, however, hit an unfortunate scenario: the old master M reappears on the grid. Those clients will successfully reconnect to M and issue writes, leading to data loss (writes to M no longer replicate anywhere).

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R. We need to change DNS records. Since this is a planned failover, we set the old master to read_only=1, or even better, we network isolated it.

And still our clients take 10 to 70 seconds to recognize the new master.

Discussion

The above numbers are just illustrative. Perhaps DNS deployment is quicker than 10 seconds. You should do your own math.

TTL is a compromise which you can tune. Setting lower TTL will mitigate the problem, but will cause more hits on the DNS servers.

For planned takeover we can first deploy a change to the TTL, to, say, 2sec, wait 60sec, then deploy the IP change, then restore TTL to 60.

You may choose to restart apps upon DNS deployment. This emulates apps’ awareness of the change.

Sample orchestrator configuration

orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PostMasterFailoverProcesses": [
    "/do/what/you/gotta/do to apply dns change for {failureClusterAlias}-writer.example.net to {successorHost}"
  ],  

In the above:

  • ApplyMySQLPromotionAfterMasterFailover instructs orchestrator to set read_only=0; reset slave all on promoted server.
  • PostMasterFailoverProcesses really depends on your setup. But orchestrator will supply with hints to your scripts: identity of cluster, identity of successor.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-1-dns/feed 0 7861
Using dbdeployer in CI tests https://shlomi-noach.github.io/blog/mysql/using-dbdeployer-in-ci-tests https://shlomi-noach.github.io/blog/mysql/using-dbdeployer-in-ci-tests#respond Tue, 20 Feb 2018 07:29:58 +0000 https://shlomi-noach.github.io/blog/?p=7848 I was very pleased when Giuseppe Maxia (aka datacharmer) unveiled dbdeployer in his talk at pre-FOSDEM MySQL day. The announcement came just at the right time. I wish to briefly describe how we use dbdeployer (work in progress).

The case for gh-ost

A user opened an issue on gh-ost, and the user was using MySQL 5.5. gh-ost is being tested on 5.7 where the problem does not reproduce. A discussion with Gillian Gunson raised the concern of not testing on all versions. Can we run gh-ost tests for all MySQL/Percona/MariaDB versions? Should we? How easy would it be?

gh-ost tests

gh-ost has three different test types:

  • Unit tests: these are plain golang logic tests which are very easy and quick to run.
  • Integration tests: the topic of this post, see following. Today these do not run as part of an automated CI testing.
  • System tests: putting our production tables to the test, continuously migrating our production data on dedicated replicas, verifying checksums are identical and data is intact, read more.

Unit tests are already running as part of automated CI (every PR is subjected to those tests). Systems tests are clearly tied to our production servers. What’s the deal with the integration tests?

gh-ost integration tests

The gh-ost integration tests are a suite of scenarios which verify gh-ost‘s operation is sound. These scenarios are mostly concerned with data types, special alter statements etc. Is converting DATETIME to TIMESTAMP working properly? Are latin1 columns being updated correctly? How about renaming a column? Changing a PRIMARY KEY? Column reorder? 5.7 JSON values? And so on. Each test will recreate the table, run migration, stop replication, check the result, resume replication…

The environment for these tests is a master-replica setup, where gh-ost modifies on the table on the replica and can then checksum or compare both the original and the altered ghost table.

We develop gh-ost internally at GitHub, but it’s also an open source project. We have our own internal CI environment, but then we also wish the public to have visibility into test failures (so that a user can submit a PR and get a reliable automated feedback). We use Travis CI for the public facing tests.

To run gh-ost‘s integration tests as described above as part of our CI tests we should be able to:

  • Create a master/replica setup in CI.
  • Actually, create a master/replica setup in any CI, and namely in Travis CI.
  • Actually, create multiple master/replica setups, of varying versions and vendors, in any ci, including both our internal CI and Travis CI.

I was about to embark on a MySQL Sandbox setup, which I was not keen on. But FOSDEM was around the corner and I had other things to complete beforehand. Lucky me, dbdeplyer stepped in.

dbdeployer

dbdeployer is a rewrite, a replacement to MySQL Sandbox. I’ve been using MySQL Sandbox for many years, and my laptop is running two sandboxes at this very moment. But MySQL Sandbox has a few limitations or complications:

  • Perl. Versions of Perl. Dependencies of packages of Perl. I mean, it’s fine, we can automate that.
  • Command line flag complexity: I always get lost in the complexity of the flags.
  • Get it right or prepare for battle: if you deployed something, but not the way you wanted, there’s sometimes limbo situations where you cannot re-deploy the same sandbox again, or you should start deleting files everywhere.
  • Deploy, not remove. Adding a sandbox is one thing. How about removing it?

dbdeployer is a golang rewrite, which solves the dependency problem. It ships as a single binary and nothing more is needed. It is simple to use. While it generates the equivalence of a that of a MySQL Sandbox, it does so with less command line flags and less confusion. There’s first class handling of the MySQL binaries: you unpack MySQL tarballs, you can list what’s available. You can then create sandbox environments: replication, standalone, etc. You can then delete those.

It’s pretty simple and I have not much more to add — which is the best thing about it.

So, with dbdeployer it is easy to create a master/replica. Something like:

dbdeployer unpack path/to/5.7.21.tar.gz --unpack-version=5.7.21 --sandbox-binary ${PWD}/sandbox/binary
dbdeployer replication 5.7.21 --nodes 2 --sandbox-binary ${PWD}/sandbox/binary --sandbox-home ${PWD}/sandboxes --gtid --my-cnf-options log_slave_updates --my-cnf-options log_bin --my-cnf-options binlog_format=ROW

Where does it all fit in, and what about the MySQL binaries though?

So, should dbdeployer be part of the gh-ost repo? And where does one get those MySQL binaries from? Are they to be part of the gh-ost repo? Aren’t they a few GB to extract?

Neither dbdeployer nor MySQL binaries should be added to the gh-ost repo. And fortunately, Giuseppe also solved the MySQL binaries problem.

The scheme I’m looking at right now is as follows:

  • A new public repo, gh-ost-ci-env is created. This repo includes:
    • dbdeployer compiled binaries
    • Minimal MySQL tarballs for selected versions. Those tarballs are reasonably small: between `14MB` and `44MB` at this time.
  • gh-ost‘s CI to git clone https://github.com/github/gh-ost-ci-env.git (code)
  • gh-ost‘s CI to setup a master/replica sandbox (one, two).
  • Kick the tests.

The above is a work in progress:

  • At this time only runs a single MySQL version.
  • There is a known issue where after a test, replication may take time to resume. Currently on slower boxes (such as the Travis CI containers) this leads to failures.

Another concern I have at this time is build time. For a single MySQL version, it takes some 5-7 minutes on my local laptop to run all integration tests. It will be faster on our internal CI. It will be considerably slower on Travis CI, I can expect between 10m - 15m. Add multiple versions and we’re looking at a 1hr build. Such long build times will affect our development and delivery times, and so we will split them off the main build. I need to consider what the best approach is.

That’s all for now. I’m pretty excited for the potential of dbdeployer and will be looking into incorporating the same for orchestrator CI tests.

 

 

]]>
https://shlomi-noach.github.io/blog/mysql/using-dbdeployer-in-ci-tests/feed 0 7848
orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more https://shlomi-noach.github.io/blog/mysql/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more https://shlomi-noach.github.io/blog/mysql/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more#respond Mon, 29 Jan 2018 09:40:05 +0000 https://shlomi-noach.github.io/blog/?p=7833 orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:

Faster failure detection

Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator now detects failure faster than before:

  • A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every 5sec, failure detection time dropped from 7-10sec to 3-5sec, keeping reliability. The reduction in time does not lead to increased false positives.
    Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster).
  • Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, orchestrator now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec.

Faster master recoveries

Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.

With recent changes, orchestrator is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator is able to immediate promote it (i.e. run hooks, set read_only=0 etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.

This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.

Between faster detection and faster recoveries, we’re looking at some 10sec reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec in almost all cases, and < 15s in optimal cases. Those times are measured on our failover tests.

We are working on reducing failover time unrelated to orchestrator and hope to update soon.

Automated Pseudo-GTID

As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows.

There’s still many setups out there where GTID is not (yet?) deployed and enabled. However, Pseudo-GTID is often misunderstood, and though I’ve blogged and presented Pseudo-GTID many times in the past, I still find myself explaining to people the setup is simple and does not involve change to one’s topologies.

Well, it just got simpler. orchestrator is now able to automatically inject Pseudo-GTID for you.

Say the word: "AutoPseudoGTID": true, grant the necessary privilege, and your non-GTID topology is suddenly supercharged with magical Pseudo-GTID tokens that provide you with:

  • Arbitrary relocation of replicas
  • Automated or manual failovers (masters and intermediate masters)
  • Vendor freedom: runs on Oracle MySQL, Percona Server, MariaDB, or all of the above at the very same time.
  • Version freedom (still on 5.5? No problem. Oh, this gets you crash-safe replication as extra bonus, too)

Auto-Pseudo-GTID further simplifies the infrastructure in that you no longer need to take care of injecting Pseudo-GTID onto the master as well as handle master identity changes. No more event_scheduler to enable/disable nor services to start/stop.

More and more setups are moving to GTID. We may, too! But I find it peculiar that Pseudo-GTID was suggested 4 years ago, when 5.6 GTID was already released, and still many setups are not yet running GTID. If you’re not using GTID, please try Pseudo-GTID! Read more.

Semi-sync support

Semi-sync has been internally supported via a specialized patch contributed by Vitess, to flag a server as semi-sync-able and handle enablement of semi-sync upon master failover.

orchestrator now supports semi-sync more generically. You may use orchestrator to enable/disable semi-sync master/replica side, via orchestrator -c enable-semi-sync-master, orchestrator -c enable-semi-sync-replica, orchestrator -c disable-semi-sync-master, orchestrator -c disable-semi-sync-replica commands (or API equivalent).

The API will also tell you whether semi-sync is enabled on instances. Noteworthy that configured != enabled. A server can be configured with rpl_semi_sync_master_enabled=ON, but if no semi-sync replicas are found, the Rpl_semi_sync_master_status state is OFF.

More

UI changes, removal of prepared statements, documentation updates, raft updates…

orchestrator is free and open source and released under the Apache 2 license. It is authored at and used by GitHub.

I’ll be presenting orchestrator/raft in FOSDEM next week, at the MySQL and Friends Room.

]]>
https://shlomi-noach.github.io/blog/mysql/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more/feed 0 7833
orchestrator/raft: Pre-Release 3.0 https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0 https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0#comments Thu, 03 Aug 2017 08:41:11 +0000 https://shlomi-noach.github.io/blog/?p=7740 orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script.

TL;DR

You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator‘s backend MySQL servers (such as Galera/InnoDB Cluster). In fact, you can run a orchestrator/raft setup using embedded SQLite backend DB. Read on.

orchestrator still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to 3.0 and do not configure raft.

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLite is supported. orchestrator now ships with SQLite embedded, no external dependency needed.

In a orchestrator/raft setup, all orchestrator nodes talk to each other. One and only one is elected as leader. To become a leader a node must be part of a quorum. On a 3 node setup, it takes 2 nodes to form a quorum. On a 5 node setup, it takes 3 nodes to form a quorum.

Only the leader will run failovers. This much is similar to the existing shared-backend DB setup. However in a orchestrator/raft setup each node is independent, and each orchestrator node runs discoveries. This means a MySQL server in your topology will be routinely visited and probed by not one orchestrator node, but by all 3 (or 5, or what have you) nodes in your raft cluster.

Any communication to orchestrator must take place through the leader. One may not tamper directly with the backend DBs anymore, since the leader is the one authoritative entity to replicate and announce changes to its peer nodes. See orchestrator-client section following.

For details, please refer to the documentation:

The orchetrator/raft setup comes to solve several issues, the most obvious is high availability for the orchestrator service: in a 3 node setup any single orchestrator node can go down and orchestrator will reliably continue probing, detecting failures and recovering from failures.

Another issue solve by orchestrator/raft is network isolation, in particularly cross-DC, also refered to as fencing. Some visualization will help describe the issue.

Consider this 3 data-center replication setup. The master, along with a few replicas, resides on DC1. Two additional DCs have intermediate masters, aka local-masters, that relay replication to local replicas.

We place 3 orchestrator nodes in a raft setup, each in a different DC. Note that traffic between orchestrator nodes is very low, and cross DC latencies still conveniently support the raft communication. Also note that backend DB writes have nothing to do with cross-DC traffic and are unaffected by latencies.

Consider what happens if DC1 gets network isolated: no traffic in or out DC1

Each orchestrator nodes operates independently, and each will see a different state. DC1’s orchestrator will see all servers in DC2, DC3 as dead, but figure the master itself is fine, along with its local DC1 replicas:

However both orchestrator nodes in DC2 and DC3 will see a different picture: they will see all DC1’s servers as dead, with local masters in DC2 and DC3 having broken replication:

Who gets to choose?

In the orchestrator/raft setup, only the leader runs failovers. The leader must be part of a quorum. Hence the leader will be an orchestrator node in either DC2 or DC3. DC1’s orchestrator will know it is isolated, that it isn’t part of the quorum, hence will step down from leadership (that’s the premise of the raft consensus protocol), hence will not run recoveries.

There will be no split brain in this scenario. The orchestrator leader, be it in DC2 or DC3, will act to recover and promote a master from within DC2 or DC3. A possible outcome would be:

What if you only have 2 data centers?

In such case it is advisable to put two orchestrator nodes, one in each of your DCs, and a third orchestrator node as a mediator, in a 3rd DC, or in a different availability zone. A cloud offering should do well:

The orchestrator/raft setup plays nice and allows one to nominate a preferred leader.

SQLite

Suggested and requested by many, is to remove orchestrator‘s own dependency on a MySQL backend. orchestrator now supports a SQLite backend.

SQLite is a transactional, relational, embedded database, and as of 3.0 it is embedded within orchestrator, no external dependency required.

SQLite doesn’t replicate, doesn’t support client/server protocol. As such, it cannot work as a shared database backend. SQLite is only available on:

  • A single node setup: good for local dev installations, testing server, CI servers (indeed, SQLite now runs in orchestrator‘s CI)
  • orchestrator/raft setup, where, as noted above, backend DBs do not communicate with each other in the first place and are each dedicated to their own orchestrator node.

It should be pointed out that SQLite is a great transactional database, however MySQL is more performant. Load on backend DB is directly (and mostly linearly) affected by the number of probed servers. If you have 50 servers in your topologies or 500 servers, that matters. The probing frequency of course also matters for the write frequency on your backend DB. I would suggest if you have thousands of backend servers, to stick with MySQL. If dozens, SQLite should be good to go. In between is a gray zone, and at any case run your own experiments.

At this time SQLite is configured to commit to file; there is a different setup where SQLite places data in-memory, which makes it faster to execute. Occasional dumps required for durability. orchestrator may support this mode in the future.

orchestrator-client

You install orchestrator as a service on a few boxes; but then how do you access it from other hosts?

  • Either curl the orchestrator API
  • Or, as most do, install orchestrator-cli package, which includes the orchestrator binary, everywhere.

The latter implies:

  • Having the orchestrator binary installed everywhere, hence updated everywhere.
  • Having the /etc/orchestrator.conf.jsondeployed everywhere, along with credentials.

The orchestrator/raft setup does not support running orchestrator in command-line mode. Reason: in this mode orchestrator talks directly to the shared backend DB. There is no shared backend DB in the orchestrator/raft setup, and all communication must go through the leader service. This is a change of paradigm.

So, back to curling the HTTP API. Enter orchestrator-client which mimics the command line interface, while running curl | jq requests against the HTTP API. orchestrator-client, however, is just a shell script.

orchestrator-client will work well on either orchestrator/raft or on your existing non-raft setups. If you like, you may replace your remote orchestrator installations and your /etc/orchestrator.conf.json deployments with this script. You will need to provide the script with a hint: the $ORCHESTRATOR_API environment variable should be set to point to the orchestrator HTTP API.

Here’s the fun part:

  • You will either have a proxy on top of your orchestrator service cluster, and you would export ORCHESTRATOR_API=http://my.orchestrator.service/api
  • Or you will provide orchestrator-client with all orchestrator node identities, as in export ORCHESTRATOR_API="https://orchestrator.host1:3000/api https://orchestrator.host2:3000/api https://orchestrator.host3:3000/api" .
    orchestrator-client will figure the identity of the leader and will forward requests to the leader. At least scripting-wise, you will not require a proxy.

Status

orchestrator 3.0 is a Pre-Release. We are running a mostly-passive orchestrator/raft setup in production. It is mostly-passive in that it is not in charge of failovers yet. Otherwise it probes and analyzes our topologies, as well as runs failure detection. We will continue to improve operational aspects of the orchestrator/raft setup (see this issue).

 

]]>
https://shlomi-noach.github.io/blog/mysql/orchestratorraft-pre-release-3-0/feed 4 7740