MySQL – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Tue, 26 May 2020 17:52:52 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 orchestrator on DB AMA: show notes https://shlomi-noach.github.io/blog/mysql/orchestrator-on-db-ama-show-notes https://shlomi-noach.github.io/blog/mysql/orchestrator-on-db-ama-show-notes#respond Tue, 26 May 2020 17:52:52 +0000 https://shlomi-noach.github.io/blog/?p=8101 Earlier today I presented orchestrator on DB AMA. Thank you to the organizers Morgan Tocker, Liz van Dijk and Frédéric Descamps for hosting me, and thank you to all who participated!

This was a no-slides, all command-line walkthrough of some of orchestrator‘s capabilities, highlighting refactoring, topology analysis, takeovers and failovers, and discussing a bit of scripting and HTTP API tips.

The recording is available on YouTube (also embedded on https://dbama.now.sh/#history).

To present orchestrator, I used the new shiny docker CI environment; it’s a single docker image running orchestrator, a 4-node MySQL replication topology (courtesy dbdeployer), heartbeat injection, Consul, consul-template and HAProxy. You can run it, too! Just clone the orchestrator repo, then run:

./script/dock system

From there, you may follow the same playbook I used in the presentation, available as orchestrator-demo-playbook.sh.

Hope you find the presentation and the playbook to be useful resources.

]]>
https://shlomi-noach.github.io/blog/mysql/orchestrator-on-db-ama-show-notes/feed 0 8101
orchestrator: what’s new in CI, testing & development https://shlomi-noach.github.io/blog/mysql/orchestrator-whats-new-in-ci-testing-development https://shlomi-noach.github.io/blog/mysql/orchestrator-whats-new-in-ci-testing-development#respond Mon, 11 May 2020 08:01:08 +0000 https://shlomi-noach.github.io/blog/?p=8077 Recent focus on development & testing yielded with new orchestrator environments and offerings for developers and with increased reliability and trust. This post illustrates the new changes, and see Developers section on the official documentation for more details.

Testing

In the past four years orchestrator was developed at GitHub, and using GitHub’s environments for testing. This is very useful for testing orchestrator‘s behavior within GitHub, interacting with its internal infrastructure, and validating failover behavior in a production environment. These tests and their results are not visible to the public, though.

Now that orchestrator is developed outside GitHub (that is, outside GitHub the company, not GitHub the platform) I wanted to improve on the testing framework, making it visible, accessible and contribute-able to the community. Thankfully, the GitHub platform has much to offer on that front and orchestrator now uses GitHub Actions more heavily for testing.

GitHub Actions provide a way to run code in a container in the context of the repository. The most common use case is to run CI tests on receiving a Pull Request. Indeed, when GitHub Actions became available, we switched out of Travis CI and into Actions for orchestrator‘s CI.

Today, orchestrator runs three different tests:

  • Build, unit testing, integration testing, code & doc validation
  • Upgrade testing
  • System testing

To highlight what each does:

Build, unit testing, integration testing

Based on the original CI (and possibly will split into distinct tests), this CI Action compiles the code, runs unit tests, runs the suite of integration tests (spins up both MySQL and SQLite databases and runs a series of tests on each backend), this CI job is the “basic” test to see that the contributed code even makes sense.

What’s new in this test is that it now produces an artifact: an orchestrator binary for Linux/amd64. This is again a feature for GitHub Actions; the artifact is kept for a couple months or so per Actions retention policy. Here‘s an example; by the time you read this the binary artifact may or may not still be there.

This means you don’t actually need a development environment on your laptop to be able to build and orchestrator binary. More on this later.

Upgrade testing

Until recently not formalized; I’d test upgrades by deploying them internally at GitHub onto a staging environment. Now upgrades are tested per Pull Request: we spin up a container, deploy orchestrator from master branch using both MySQL and SQLite backends, then checkout the PR branch, and redeploy orchestrator using the existing backends — this verifies that at least backend-database wise, there’s not upgrade errors.

At this time the test only validates the database changes are applicable; in the future this may expand onto more elaborate tests.

System testing

I’m most excited about this one. Taking ideas from our approach to testing gh-ost with dbdeployer, I created https://github.com/openark/orchestrator-ci-env, which offers a full blown testing enviroment for orchestrator, including a MySQL replication topology (courtesy dbdeployer), Consul, HAProxy and more.

This CI testing environment can also serve as a playground in your local docker setup, see shortly.

The system tests suite offers full blown cluster-wide operations such as graceful takeovers, master failovers, errant GTID transaction analysis and recovery and more. The suite utilizes the CI testing environment, breaks it, rebuilds it, validates it… Expects specific output, expects specific failure messages, specific analysis, specific outcomes.

As example, with the system tests suite, we can test the behavior of a master failover in a multi-DC, multi-region (obviously simulated) environment, where a server marked as “candidate” is lagging behind all others, with strict rules for cross-site/cross-region failovers, and still we wish to see that particular replica get promoted as master. We can test not only the topology aspect of the failover, but also the failover hooks, Consul integration and its effects, etc.

Development

There’s now multiple options for developers/contributors to build or just try out orchestrator.

Build on GitHub

As mentioned earlier, you actually don’t need a development environment. You can use orchestrator CI to build and generate a Linux/amd64 orchestrator binary, which you can download & deploy as you see fit.

I’ve signed up for the GitHub Codespaces beta program, and hope to make that available for orchestrator, as well.

Build via Docker

orchestrator offers various Docker build/run environments, accessible via the script/dock script:

  • `script/dock alpine` will build and spawn `orchestrator` on a minimal alpine linux
  • `script/dock test` will build and run the same CI tests (unit, integration) as mentioned earlier, but on your own docker environemtn
  • `script/dock pkg` will build and generate `.rpm` and `.deb` packages

CI environment: the “full orchestrator experience”

This is the orchestrator amusement park. Run script/dock system to spawn the aforementioned CI environment used in system tests, and on top of that, an orchestrator setup fully integrated with that system.

So that’s an orchestrator-MySQL topology-Consul-HAProxy setup, where orchestrator already has the credentials for, and pre-loads the MySQL topology, pre-configured to update Consul upon failover, HAProxy config populated by consul-template, heartbeat injection, and more. It resembles the HA setup at GitHub, and in the future I expect to provide alternate setups (on top).

Once in that docker environment, one can try running relocations, failovers, test orchestrator‘s behavior, etc.

Community

GitHub recently announced GitHub Discussions ; think a stackoverflow like place within one’s repo to ask questions, discuss, vote on answers. It’s expected to be available this summer. When it does, I’ll encourage the community to use it instead of today’s orchestrator-mysql Google Group and of course the many questions posted as Issues.

There’s been a bunch of PRs merged recently, with more to come later on. I’m grateful for all contributions. Please understand if I’m still slow to respond.

]]>
https://shlomi-noach.github.io/blog/mysql/orchestrator-whats-new-in-ci-testing-development/feed 0 8077
Pulling this blog out of Planet MySQL aggregator, over community concerns https://shlomi-noach.github.io/blog/mysql/pulling-this-blog-out-of-planet-mysql-aggregator-over-community-concerns https://shlomi-noach.github.io/blog/mysql/pulling-this-blog-out-of-planet-mysql-aggregator-over-community-concerns#respond Thu, 23 Apr 2020 15:26:24 +0000 https://shlomi-noach.github.io/blog/?p=8020 I’ve decided to pull this blog (https://shlomi-noach.github.io/blog/) out of the planet.mysql.com aggregator.

planet.mysql.com (formerly planetmysql.com) serves as a blog aggregator, and collects news and blog posts on various MySQL and its ecosystem topics. It collects some vendor and team blogs as well as “indie” blogs such as this one.

It has traditionally been the go-to place to catch up on the latest developments, or to read insightful posts. This blog itself has been aggregated in Planet MySQL for some eleven years.

Planet MySQL used to be owned by the MySQL community team. This recently changed with unwelcoming implications for the community.

I recently noticed how a blog post of mine, The state of Orchestrator, 2020 (spoiler: healthy), did not get aggregated in Planet MySQL. After a quick discussion and investigation, it was determined (and confirmed) it was filtered out because it contained the word “MariaDB”. It has later been confirmed that Planet MySQL now filters out posts indicating its competitors, such as MariaDB, PostgreSQL, MongoDB.

Planet MySQL is owned by Oracle and it is their decision to make. Yes, logic implies they would not want to publish a promotional post for a competitor. However, I wish to explain how this blind filtering negatively affects the community.

But, before that, I’d like to share that I first attempted to reach out to whoever is in charge of Planet MySQL at this time (my understanding is that this is a marketing team). Sadly, two attempts at reaching out to them individually, and another attempt at reaching out on behalf of a small group of individual contributors, yielded no response. The owners would not have audience with me, and would not hear me out. I find it disappointing and will let others draw morals.

Why filtering is harmful for the community

We recognize that planet.mysql.com is an important information feed. It is responsible for a massive ratio of the traffic on my blog, and no doubt for many others. Indie blog posts, or small-team blog posts, practically depend on planet.mysql.com to get visibility.

And this is particularly important if you’re an open source developer who is trying to promote an open source project in the MySQL ecosystem. Without this aggregation, you will get significantly less visibility.

But, open source projects in the MySQL ecosystem do not live in MySQL vacuum, and typically target/support MySQL, Percona Server and MariaDB. As examples:

  • DBDeployer should understand MariaDB versioning scheme

  • skeema needs to recognize MariaDB features not present in MySQL

  • ProxySQL needs to support MariaDB Galera queries

  • orchestrator needs to support MariaDB’s GTID flavor

Consider that a blog post of the form “Project version 1.2.3 now released!” is likely to mention things like “fixed MariaDB GTID setup” or “MariaDB 10.x now supported” etc. Consider just pointing out that “PROJECT X supports MySQL, MariaDB and Percona Server”.

Consider that merely mentioning “MariaDB” gets your blog post filtered out on planet.mysql.com. This has an actual impact on open source development in the MySQL ecosystem. We will lose audience and lose adoption.

I believe the MySQL ecosystem as a whole will be negatively affected as result, and this will circle back to MySQL itself. I believe this goes against the very interests of Oracle/MySQL.

I’ve been around the MySQL community for some 12 years now. From my observation, there is no doubt that MySQL would not thrive as it does today, without the tooling, blogs, presentations and general advice by the community.

This is more than an estimation. I happen to know that, internally at MySQL, they have used or are using open source projects from the community, projects whose blog posts get filtered out today because they mention “MariaDB”. I find that disappointing.

I have personally witnessed how open source developments broke existing barriers to enable companies to use MySQL at greater scale, in greater velocity, with greater stability. I was part of such companies and I’ve personally authored such tools. I’m disappointed that planet.mysql.com filters out my blog posts for those tools and without giving me audience, and extend my disappointment for all open source project maintainers.

At this time I consider planet.mysql.com to be a marketing blog, not a community feed, and do not want to participate in its biased aggregation.

]]>
https://shlomi-noach.github.io/blog/mysql/pulling-this-blog-out-of-planet-mysql-aggregator-over-community-concerns/feed 0 8020
The state of Orchestrator, 2020 (spoiler: healthy) https://shlomi-noach.github.io/blog/mysql/the-state-of-orchestrator-2020-spoiler-healthy-2 https://shlomi-noach.github.io/blog/mysql/the-state-of-orchestrator-2020-spoiler-healthy-2#respond Tue, 18 Feb 2020 19:14:12 +0000 https://shlomi-noach.github.io/blog/?p=8016 This post serves as a pointer to my previous announcement about The state of Orchestrator, 2020.

Thank you to Tom Krouper who applied his operational engineer expertise to content publishing problems.

]]>
https://shlomi-noach.github.io/blog/mysql/the-state-of-orchestrator-2020-spoiler-healthy-2/feed 0 8016
The state of Orchestrator, 2020 (spoiler: healthy) https://shlomi-noach.github.io/blog/mysql/the-state-of-orchestrator-2020-spoiler-healthy https://shlomi-noach.github.io/blog/mysql/the-state-of-orchestrator-2020-spoiler-healthy#respond Tue, 18 Feb 2020 08:09:05 +0000 https://shlomi-noach.github.io/blog/?p=7996 Yesterday was my last day at GitHub, and this post explains what this means for orchestrator. First, a quick historical review:

  • 2014: I began work on orchestrator at Outbrain, as https://github.com/outbrain/orchestrator. I authored several open source projects while working for Outbrain, and created orchestrator to solve discovery, visualization and simple refactoring needs. Outbrain was happy to have the project developed as a public, open source repo from day 1, and it was released under the Apache 2 license. Interestingly, the idea to develop orchestrator came after I attended Percona Live Santa Clara 2014 and watched “ChatOps: How GitHub Manages MySQL” by one Sam Lambert.
  • 2015: Joined Booking.com where my main focus was to redesign and solve issues with the existing high availability setup. With Booking.com’s support, I continued work on orchestrator, pursuing better failure detection and recovery processes. Booking.com was an incredible playground and testbed for orchestrator, a massive deployment of multiple MySQL/MariaDB flavors and configuration.
  • 2016 – 2020: Joined GitHub. GitHub adopted orchestrator and I developed it under GitHub’s own org, at https://github.com/github/orchestrator. It became a core component in github.com’s high availability design, running failure detection and recoveries across sites and geographical regions, with more to come. These 4+ years have been critical to orchestrator‘s development and saw its widespread use. At this time I’m aware of multiple large-scale organizations using orchestrator for high availability and failovers. Some of these are GitHub, Booking.com, Shopify, Slack, Wix, Outbrain, and more. orchestrator is the underlying failover mechanism for vitess, and is also included in Percona’s PMM. These years saw a significant increase in community adoption and contributions, in published content, such as Pythian and Percona technical blog posts, and, not surprisingly, increase in issues and feature requests.


2020

GitHub was very kind to support moving the orchestrator repo under my own https://github.com/openark org. This means all issues, pull requests, releases, forks, stars and watchers have automatically transferred to the new location: https://github.com/openark/orchestrator. The old links do a “follow me” and implicitly direct to the new location. All external links to code and docs still work. I’m grateful to GitHub for supporting this transfer.

I’d like to thank all the above companies for their support of orchestrator and of open source in general. Being able to work on the same product throughout three different companies is mind blowing and an incredible opportunity. orchestrator of course remains open source and licensed with Apache 2. Existing Copyrights are unchanged.

As for what’s next: some personal time off, please understand if there’s delays to reviews/answers. My intention is to continue developing orchestrator. Naturally, the shape of future development depends on how orchestrator meets my future work. Nothing changes in that respect: my focus on orchestrator has always been first and foremost the pressing business needs, and then community support as possible. There are some interesting ideas by prominent orchestrator users and adopters and I’ll share more thoughts in due time.

 

]]>
https://shlomi-noach.github.io/blog/mysql/the-state-of-orchestrator-2020-spoiler-healthy/feed 0 7996
Quick hack for GTID_OWN lack https://shlomi-noach.github.io/blog/mysql/quick-hack-for-gtid_own-lack https://shlomi-noach.github.io/blog/mysql/quick-hack-for-gtid_own-lack#respond Wed, 11 Dec 2019 08:00:00 +0000 https://shlomi-noach.github.io/blog/?p=7974 One of the benefits of MySQL GTIDs is that each server remembers all GTID entries ever executed. Normally these would be ranges, e.g. 0041e600-f1be-11e9-9759-a0369f9435dc:1-3772242 or multi-ranges, e.g. 24a83cd3-e30c-11e9-b43d-121b89fcdde6:1-103775793, 2efbcca6-7ee1-11e8-b2d2-0270c2ed2e5a:1-356487160, 46346470-6561-11e9-9ab7-12aaa4484802:1-26301153, 757fdf0d-740e-11e8-b3f2-0a474bcf1734:1-192371670, d2f5e585-62f5-11e9-82a5-a0369f0ed504:1-10047.

One of the common problems in asynchronous replication is the issue of consistent reads. I’ve just written to the master. Is the data available on a replica yet? We have iterated on this, from reading on master, to heuristically finding up-to-date replicas based on heartbeats (see presentation and slides) via freno, and now settled, on some parts of our apps, to using GTID.

GTIDs are reliable as any replica can give you a definitive answer to the question: have you applied a given transaction or not?. Given a GTID entry, say f7b781a9-cbbd-11e9-affb-008cfa542442:12345, one may query for the following on a replica:

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:12345', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 1 |
+-------------------+

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:123450000', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 0 |
+-------------------+

Getting OWN_GTID

This is all well, but, given some INSERT or UPDATE on the master, how can I tell what’s the GTID associated with that transaction? There\s good news and bad news.

  • Good news is, you may SET SESSION session_track_gtids = OWN_GTID. This makes the MySQL protocol return the GTID generated by your transaction.
  • Bad news is, this isn’t a standard SQL response, and the common MySQL drivers offer you no way to get that information!

At GitHub we author our own Ruby driver, and have implemented the functionality to extract OWN_GTID, much like you’d extract LAST_INSERT_ID. But, how does one solve that without modifying the drivers? Here’s a poor person’s solution which gives you an inexact, but good enough, info. Following a write (insert, delete, create, …), run:

select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;

The idea is to “clean” the executed GTID set from irrelevant entries, by filtering out all ranges that do not belong to the server you’ve just written to (the master). The number 1000000000000000 stands for “high enough value that will never be reached in practice” – set to your own preferred value, but this value should take you beyond 300 years assuming 100,000 transactions per second.

The value you get is the range on the master itself. e.g.:

mysql> select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
+-------------------------------------------------+
| master_generated_gtid                           |
+-------------------------------------------------+
| dc103953-1598-11ea-82a7-008cfa5440e4:1-35807176 |
+-------------------------------------------------+

You may further parse the above to extract dc103953-1598-11ea-82a7-008cfa5440e4:35807176 if you want to hold on to the latest GTID entry. Now, this entry isn’t necessarily your own. Between the time of your write and the time of your GTID query, other writes will have taken place. But the entry you get is either your own or a later one. If you can find that entry on a replica, that means your write is included on the replica.

One may wonder, why do we need to extract the value at all? Why not just select @@global.gtid_executed? Why filter only the master‘s UUID? Logically, the answer is the same if you do that. But in practice, your query may be unfortunate enough to return some:

select @@global.gtid_executed \G

e71f0cdb-b8ef-11e9-9361-008cfa542442:1-83331,
e742d87f-dea7-11e9-be6d-008cfa542c9e:1-18485,
e7880c0e-ac54-11e9-865a-008cfa544064:1-7331973,
e82043c6-c7d9-11e9-9413-008cfa5440e4:1-61692,
e902678b-b046-11e9-a281-008cfa542c9e:1-83108,
e90d7ff9-e35e-11e9-a9a0-008cfa544064:1-18468,
e929a635-bb40-11e9-9c0d-008cfa5440e4:1-139348,
e9351610-ef1b-11e9-9db4-008cfa5440e4:1-33460918,
e938578d-dc41-11e9-9696-008cfa542442:1-18232,
e947f165-cd53-11e9-b7a1-008cfa5440e4:1-18480,
e9733f37-d537-11e9-8604-008cfa5440e4:1-18396,
e97a0659-e423-11e9-8433-008cfa542442:1-18237,
e98dc1f7-e0f8-11e9-9bbd-008cfa542c9e:1-18482,
ea16027a-d20e-11e9-9845-008cfa542442:1-18098,
ea1e1aa6-e74a-11e9-a7f2-008cfa544064:1-18450,
ea8bc1bd-dd06-11e9-a10c-008cfa542442:1-18203,
eae8c750-aaca-11e9-b17c-008cfa544064:1-85990,
eb1e41e9-af81-11e9-9ceb-008cfa544064:1-86220,
eb3c9b3b-b698-11e9-b67a-008cfa544064:1-18687,
ec6daf7e-b297-11e9-a8a0-008cfa542c9e:1-80652,
eca4af92-c965-11e9-a1f3-008cfa542c9e:1-18333,
ecd110b9-9647-11e9-a48f-008cfa544064:1-24213,
ed26890e-b10b-11e9-a79d-008cfa542c9e:1-83450,
ed92b3bf-c8a0-11e9-8612-008cfa542442:1-18223,
eeb60c82-9a3d-11e9-9ea5-008cfa544064:1-1943152,
eee43e06-c25d-11e9-ba23-008cfa542442:1-105102,
eef4a7fb-b438-11e9-8d4b-008cfa5440e4:1-74717,
eefdbd3b-95b3-11e9-833d-008cfa544064:1-39415,
ef087062-ba7b-11e9-92de-008cfa5440e4:1-9726172,
ef507ff0-98b3-11e9-8b15-008cfa5440e4:1-928030,
ef662471-9a3b-11e9-bd2e-008cfa542c9e:1-954800,
f002e9f7-97ee-11e9-bed0-008cfa542c9e:1-5180743,
f0233228-e9a1-11e9-a142-008cfa542c9e:1-18583,
f04780c4-a864-11e9-9f28-008cfa542c9e:1-83609,
f048acd9-b1d2-11e9-a0b6-008cfa544064:1-70663,
f0573d8c-9978-11e9-9f73-008cfa542c9e:1-85642135,
f0b0a37c-c89c-11e9-804c-008cfa5440e4:1-18488,
f0cfe1ac-e5af-11e9-bc09-008cfa542c9e:1-18552,
f0e4997c-cbc9-11e9-9179-008cfa542442:1-1655552,
f24e481c-b5c4-11e9-aff0-008cfa5440e4:1-83015,
f4578c4b-be6d-11e9-982e-008cfa5440e4:1-132701,
f48bce80-e99f-11e9-94f4-a0369f9432f4:1-18460,
f491adf1-9b04-11e9-bc71-008cfa542c9e:1-962823,
f5d3db74-a929-11e9-90e8-008cfa5440e4:1-75379,
f6696ba7-b750-11e9-b458-008cfa542c9e:1-83096,
f714cb4c-dab7-11e9-adb9-008cfa544064:1-18413,
f7b781a9-cbbd-11e9-affb-008cfa542442:1-18169,
f81f7729-b10d-11e9-b29b-008cfa542442:1-86820,
f88a3298-e903-11e9-88d0-a0369f9432f4:1-18548,
f9467b29-d78c-11e9-b1a2-008cfa5440e4:1-18492,
f9c08f5c-e4ea-11e9-a76c-008cfa544064:1-1667611,
fa633abf-cee3-11e9-9346-008cfa542442:1-18361,
fa8b0e64-bb42-11e9-9913-008cfa542442:1-140089,
fa92234c-cc90-11e9-b337-008cfa544064:1-18324,
fa9755eb-e425-11e9-907d-008cfa542c9e:1-1668270,
fb7843d5-eb38-11e9-a1ff-a0369f9432f4:1-1668957,
fb8ceae5-dd08-11e9-9ed3-008cfa5440e4:1-18526,
fbf9970e-bc07-11e9-9e4f-008cfa5440e4:1-136157,
fc0ffaee-98b1-11e9-8574-008cfa542c9e:1-940999,
fc9bf1e4-ee54-11e9-9ce9-008cfa542c9e:1-18189,
fca4672f-ac56-11e9-8a83-008cfa542442:1-82014,
fcebaa05-dab5-11e9-8356-008cfa542c9e:1-18490,
fd0c88b1-ad1b-11e9-bf3a-008cfa5440e4:1-75167,
fd394feb-e4e4-11e9-bd09-008cfa5440e4:1-18574,
fd687577-b048-11e9-b429-008cfa542442:1-83479,
fdb18995-a79f-11e9-a28d-008cfa542442:1-82351,
fdc72b7f-b696-11e9-ade9-008cfa544064:1-57674,
ff1f3b6b-c967-11e9-ae04-008cfa544064:1-18503,
ff6fe7dc-c186-11e9-9bb4-008cfa5440e4:1-103192,
fff9dd94-ed95-11e9-90b7-008cfa544064:1-911039

This can happen when you fail over to a new master, multiple times; it happens when you don’t recycle UUIDs, when you provision new hosts and let MySQL pick their UUID. Returning this amount of data per query is an excessive overhead, hence why we extract the master‘s UUID only, which is guaranteed to be limited in size.

]]>
https://shlomi-noach.github.io/blog/mysql/quick-hack-for-gtid_own-lack/feed 0 7974
Un-split brain MySQL via gh-mysql-rewind https://shlomi-noach.github.io/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind https://shlomi-noach.github.io/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind#respond Tue, 05 Mar 2019 13:51:43 +0000 https://shlomi-noach.github.io/blog/?p=7928 We are pleased to release gh-mysql-rewind, a tool that allows us to move MySQL back in time, automatically identify and rewind split brain changes, restoring a split brain server into a healthy replication chain.

I recently had the pleasure of presenting gh-mysql-rewind at FOSDEM. Video and slides are available. Consider following along with the video.

Motivation

Consider a split brain scenario: a “standard” MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.

Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?

The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT values have been taken. UNIQUE constraints will fail.

A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario – which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.

We had to make a choice: we’ve already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?

Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.

Could we have reduces the time for recovery?

There are multiple ways to do that: local backups, local delayed replicas, snapshots… We have embarked on several. In this post I wish to outline gh-mysql-rewind, which programmatically identifies the rogue (aka “bad”) transactions on the network isolated master, rewinds/reverts them, applies some bookkeeping and restores the demoted master as a healthy replica under the newly promoted master, thereby prepared to be promoted if needed.

General overview

gh-mysql-rewind is a shell script. It utilizes multiple technologies, some of which do not speak to each other, to be able to do the magic. It assumes and utilizes the following:

Some breakdown follows.

GTID

MySQL GTIDs keep track of all transactions executed on a given server. GTIDs indicate which server (UUID) originated a write, and ranges of transaction sequences. In a clean state, only one writer will generate GTIDs, and on all the replicas we would see the same GTID set, originated with the writer’s UUID.

In a split brain scenario, we would see divergence. It is possible to use GTID_SUBTRACT(old_master-GTIDs, new-master-GTIDs) to identify the exact set of transactions executed on the old, demoted master, right after the failover. This is the essence of the split brain.

For example, assume that just before the network partition, GTID on the master was 00020192-1111-1111-1111-111111111111:1-5000. Assume after the network partition the new master has UUID of 00020193-2222-2222-2222-222222222222. It began to take writes, and after some time its GTID set showed 00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200.

On the demoted master, other writes took place, leading to the GTID set 00020192-1111-1111-1111-111111111111:1-5042.

We will run…

SELECT GTID_SUBTRACT(
  '00020192-1111-1111-1111-111111111111:1-5042',
  '00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200'
);

> '00020192-1111-1111-1111-111111111111:5001-5042'

…to identify the exact set of “bad transactions” on the demoted master.

Row Based Replication

With row based replication, and with FULL image format, each DML (INSERT, UPDATE, DELETE) writes to the binary log the complete row data before and after the operation. This means the binary log has enough information for us to revert the operation.

Flashback

Developed by Alibaba, flashback has been incorporated in MariaDB. MariaDB’s mysqlbinlog utility supports a --flashback flag, which interprets the binary log in a special way. Instead of printing out the events in the binary log in order, it prints the inverted operations in reverse order.

To illustrate, let’s assume this pseudo-code sequence of events in the binary log:

insert(1, 'a')
insert(2, 'b')
insert(3, 'c')
update(2, 'b')->(2, 'second')
update(3, 'c')->(3, 'third')
insert(4, 'd')
delete(1, 'a')

A --flashback of this binary log would produce:

insert(1, 'a')
delete(4, 'd')
update(3, 'third')->(3, 'c')
update(2, 'second')->(2, 'b')
delete(3, 'c')
delete(2, 'b')
delete(1, 'a')

Alas, MariaDB and flashback do not speak MySQL GTID language. GTIDs are one of the major points where MySQL and MariaDB have diverged beyond compatibility.

The output of MariaDB’s mysqlbinlog --flashback has neither any mention of GTIDs, nor does the tool take notice of GTIDs in the binary logs in the first place.

gh-mysql-rewind

This is where we step in. GTIDs provide the information about what went wrong. flashback has the mechanism to generate the reverse sequence of statements. gh-mysql-rewind:

  • uses GTIDs to detect what went wrong
  • correlates those GTID entries with binary log files: identifies which binary logs actually contain those GTID events
  • invokes MariaDB’s mysqlbinlog --flashback to generate the reverse of those binary logs
  • injects (dummy) GTID information into the output
  • computes ETA

This last part is worth elaborating. We have created a time machine. We have the mechanics to make it work. But as any Sci-Fi fan knows, one of the most important parts of time travel is knowing ahead where (when) you are going to land. Are you back in the Renaissance? Or are you suddenly to appear on board the French Revolution? Better dress accordingly.

In our scenario it is not enough to move MySQL back in time to some consistent state. We want to know at what time we landed, so that we can instruct the rewinded server to join the replication chain as a healthy replica. In MySQL terms, we need to make MySQL “forget” everything that ever happened after the split brain: not only in terms of data (which we already did), but in terms of GTID history.

gh-mysql-rewind will do the math to project, ahead of time, at what “time” (i.e. GTID set) our time machine arrived. It will issue a `RESET MASTER; SET GLOBAL gtid_purged=’gtid-of-the-landing-time'” to make our re-winded MySQL consistent not only with some past dataset, but also with its own perception of the point in time where that dataset existed.

Limitations

Some limitations are due to MariaDB’s incompatibility with MySQL, some are due to MySQL DDL nature, some due to the fact gh-mysql-rewind is a shell script.

  • Cannot rewind DDL. DDLs are silently ignored, and will impose a problem when trying to re-apply them.
  • JSON, POINT data types are not supported.
  • The logic rewinds the MySQL server farther into the past than strictly required. This simplifies the code considerably, but imposed superfluous time to rewind+reapply, i.e. time to recover.
  • Currently, this only works one server at a time. If a group of 10 servers were network isolated together, the operation would need to run on each of these 10 servers.
  • Runs locally on each server. Requires both MySQL’s mysqlbinlog as well as MariaDB’s mysqlbinlog.

Testing

There’s lot of moving parts to this mechanism. A mixture of technologies that don’t normally speak to each other, injection of data, prediction of ETA… How reliable is all this?

We run continuous gh-mysql-rewind testing in production to consistently prove that it works as expected. Our testing uses a non-production, dedicated, functional replica. It contaminates the data on the replica. It lets gh-mysql-rewind automatically move it back in time, it joins the replica back into the healthy chain.

That’s not enough. We actually create a scenario where we can predict, ahead of testing, what the time-of-arrival will be. We checksum the data on that replica at that time. After contaminating and effectively breaking replication, we expect gh-mysql-rewind to revert the changes back to our predicted point in time. We checksum the data again. We expect 100% match.

See the video or slides for more detail on our testing setup.

Status

At this time the tool in one of several solutions we hope to never need to employ. It is stable and tested. We are looking forward to a promising MySQL development that will provide GTID-revert capabilities using standard commands, such as SELECT undo_transaction('00020192-1111-1111-1111-111111111111:5042').

We have released gh-mysql-rewind as open source, under the MIT license. The public release is a stripped down version of our own script, which has some GitHub-specific integration. We have general ideas in incorporating this functionality into higher level tools.

gh-mysql-rewind is developed by the database-infrastructure team at GitHub.

]]>
https://shlomi-noach.github.io/blog/mysql/un-split-brain-mysql-via-gh-mysql-rewind/feed 0 7928
MySQL master discovery methods, part 6: other methods https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-6-other-methods https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-6-other-methods#comments Tue, 22 May 2018 08:39:46 +0000 https://shlomi-noach.github.io/blog/?p=7854 This is the sixth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

Hard coded configuration deployment

You may use your source/config repo as master service discovery method of sorts.

The master’s identity would be hard coded into your, say, git repo, to be updated and deployed to production upon failover.

This method is simple and I’ve seen it being used by companies, in production. Noteworthy:

  • This requires a dependency of production on source availability.
    • The failover tool would need to have access to your source environment.
  • This requires a dependency of production on build/deploy flow.
    • The failover tool would need to kick build, test, deploy process.
  • Code deployment time can be long.
  • Deployment must take place on all relevant hosts, and cause for a mass refresh/reload.
    • It should interrupt processes that cannot reload themselves, such as various commonly used scripts.

Synchronous replication

This series of posts is focused on asynchronous replication, but we will do well to point out a few relevant notes on sychnronous replication (Galera, XtraDB Cluster, InnoDB Cluster).

  • Synchronous replication can act in single-writer mode or in multi-writer mode.
  • In single writer mode, apps should connect to a particular master.
    • The identity of such master can be achieved by querying the MySQL members of the cluster.
  • In multi-writer mode, apps can connect to any healthy member of the cluster.
    • This still calls for a check: is the member healthy?
  • Syncronous replication is not intended to work well cross DC.

The last bullet should perhaps be highlighted. In a cross-DC setup, and for cross-DC failovers, we are back to same requirements as with asynchronous replication, and the methods illustrated in this series of posts may apply.

  • VIPs make less sense.
  • Proxy-based solution make a lot of sense.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-6-other-methods/feed 1 7854
MySQL master discovery methods, part 5: Service discovery & Proxy https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy#respond Mon, 14 May 2018 08:08:32 +0000 https://shlomi-noach.github.io/blog/?p=7869 This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Service discovery and Proxy

Part 4 presented with an anti-pattern setup, where a proxy would infer the identify of the master by drawing conclusions from backend server checks. This led to split brains and undesired scenarios. The problem was the loss of context.

We re-introduce a service discovery component (illustrated in part 3), such that:

  • The app does not own the discovery, and
  • The proxy behaves in an expected and consistent way.

In a failover/service discovery/proxy setup, there is clear ownership of duties:

  • The failover tool own the failover itself and the master identity change notification.
  • The service discovery component is the source of truth as for the identity of the master of a cluster.
  • The proxy routes traffic but does not make routing decisions.
  • The app only ever connects to a single target, but should allow for a brief outage while failover takes place.

Depending on the technologies used, we can further achieve:

  • Hard cut for connections to old, demoted master M.
  • Black/hold off for incoming queries for the duration of failover.

We explain the setup using the following assumptions and scenarios:

  • All clients connect to master via cluster1-writer.example.net, which resolves to a proxy box.
  • We fail over from master M to promoted replica R.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Updates service discovery component that R is the new master for cluster1.

The proxy:

  • Either actively or passively learns that R is the new master, rewires all writes to go to R.
  • If possible, kills existing connections to M.

The app:

  • Needs to know nothing. Its connections to M fail, it reconnects and gets through to R.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted.

Everything is as before.

If the proxy kills existing connections to M, then the fact M is back alive turns meaningless. No one gets through to M. Clients were never aware of its identity anyhow, just as they are unaware of R‘s identity.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • In the process of promotion, M turned read-only.
  • Immediately following promotion, our failover tool updates service discovery.
  • Proxy reloads having seen the changes in service discovery.
  • Our app connects to R.

Discussion

This is a setup we use at GitHub in production. Our components are:

  • orchestrator for failover tool.
  • Consul for service discovery.
  • GLB (HAProxy) for proxy
  • Consul template running on proxy hosts:
    • listening on changes to Consul’s KV data
    • Regenerate haproxy.cfg configuration file
    • reload haproxy

As mentioned earlier, the apps need not change anything. They connect to a name that is always resolved to proxy boxes. There is never a DNS change.

At the time of failover, the service discovery component must be up and available, to catch the change. Otherwise we do not strictly require it to be up at all times.

For high availability we will have multiple proxies. Each of whom must listen on changes to K/V. Ideally the name (cluster1-writer.example.net in our example) resolves to any available proxy box.

  • This, in itself, is a high availability issue. Thankfully, managing the HA of a proxy layer is simpler than that of a MySQL layer. Proxy servers tend to be stateless and equal to each other.
  • See GLB as one example for a highly available proxy layer. Cloud providers, Kubernetes, two level layered proxies, Linux Heartbeat, are all methods to similarly achieve HA.

See also:

Sample orchestrator configuration

An orchestrator configuration would look like this:

  "ApplyMySQLPromotionAfterMasterFailover": true,
  "KVClusterMasterPrefix": "mysql/master",
  "ConsulAddress": "127.0.0.1:8500",
  "ZkAddress": "srv-a,srv-b:12181,srv-c",
  "PostMasterFailoverProcesses": [
    “/just/let/me/know about failover on {failureCluster}“,
  ],

In the above:

  • If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
  • At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
  • PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-5-service-discovery-proxy/feed 0 7869
MySQL master discovery methods, part 4: Proxy heuristics https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics#respond Thu, 10 May 2018 06:10:34 +0000 https://shlomi-noach.github.io/blog/?p=7867 Note: the method described here is an anti pattern

This is the fourth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Proxy Heuristics

In Proxy Heuristics all clients connect to the master through a proxy. The proxy observes the backend MySQL servers and determines who the master is.

This setup is simple and easy, but is an anti pattern. I recommend against using this method, as explained shortly.

Clients are all configured to connect to, say, cluster1-writer.proxy.example.net:3306. The proxy will intercept incoming requests either based on hostname or by port. It is aware of all/some MySQL backend servers in that cluster, and will route traffic to the master M.

A simple heuristic that I’ve seen in use is: pick the server that has read_only=0, a very simple check.

Let’s take a look at how this works and what can go wrong.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.

Success, we’re happy.

Configuration tip

With an automated failover solution, use read_only=1 in my.cnf at all times. Only the failover solution will set a server to read_only=0.

With this configuration, when M restarts, MySQL starts up as read_only=1.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool:

  • Fails over, but doesn’t need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • 10 seconds later M comes back to life, claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.
  • The proxy has no context. It does not know why both are reporting the same. It is unaware of failovers. All it sees is what the backend MySQL servers report.

Therein lies the problem: you can not trust multiple servers (MySQL backends) to deterministically pick a leader (the master) without them collaborating on some elaborate consensus communication.

A non planned failover illustration #3

Master M box is overloaded, issuing too many connections for incoming connections.

Our tool decides to failover.

  • And doesn’t need to run any hooks.

The proxy:

  • Notices M fails health checks (select @@global.read_only does not respond because of the load).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • Shortly followed by M recovering (since no more writes are sent its way), claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.

Again, the proxy has no context, and neither do M and R, for that matter. The context (the fact we failed over from M to R) was known to our failover tool, but was lost along the way.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we set it to read_only=1.
  • We set R to read_only=0.
  • All new connections route to R.
  • We should also instruct our Proxy to kill all previous connections to M.

This works very nicely.

Discussion

There is a substantial risk to this method. Correlation between failover and network partitioning/load (illustrations #2 and #3) is reasonable.

The root of the problem is that we expect individual servers to resolve conflicts without speaking to each other: we expect the MySQL servers to correctly claim “I’m the master” without context.

We then add to that problem by using the proxy to “pick a side” without giving it any context, either.

Sample orchestrator configuration

By way of discouraging use of this method I do not present an orchestrator configuration file.

All posts in this series

]]>
https://shlomi-noach.github.io/blog/mysql/mysql-master-discovery-methods-part-4-proxy-heuristics/feed 0 7867