But, first, as in the past this caused some confusion: when I say I’m not using foreign keys, that does not mean I don’t JOIN
tables. I still JOIN
tables. I still have some id
column in one table and some parent_id
in another. I still use the benefit of the relational model. In a sense, I do use foreign keys. What I don’t normally is the foreign key CONSTRAINT
, i.e. the declaration of a CONSTRAINT some_fk FOREIGN KEY ...
in a table’s definition.
So here are things I consider to be broken, either specific to the MySQL implementation, or in the general concept. Some are outright deal breakers for environments I’ve worked with. Others are things to work around. In no particular order:
I think there are many people unaware of this. In a way, MySQL doesn’t really support foreign keys. The InnoDB engine does. This is old history, from before InnoDB was even officially a MySQL technology, and was developed independently as a 3rd party product. There was a time when MySQL sought alternative engines. There was a time when there was a plan to implement foreign keys in MySQL, above the storage engine level. But as history goes, MySQL and InnoDB both became one with Oracle acquiring both, and I’m only guessing implementing foreign keys in MySQL became lower priority, to be eventually abandoned.
Alas, the fact foreign keys are implemented in the storage engine level has dire consequences. The engine does not have direct access to the binary log. If you create a foreign key constraint with ON DELETE|UPDATE
of SET NULL
or CASCADE
, you should be aware that cascaded operations are never written to the binary log. Consider these two tables:
CREATE TABLE `parent_table` ( `id` int NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB; CREATE TABLE `child_table` ( `id` int NOT NULL, `parent_id` int DEFAULT NULL, PRIMARY KEY (`id`), KEY `parent_id_idx` (`parent_id`), CONSTRAINT `child_parent_fk` FOREIGN KEY (`parent_id`) REFERENCES `parent_table` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB; insert into parent_table values (1); insert into child_table values (1, 1); insert into child_table values (2, 1);
If you were to DELETE FROM parent_table WHERE id=1
, then the two rows in child_table
are also deleted, due to the CASCADE
rule. However, only the parent_table
deleted row is written in the binary log. The two child_table
rows are deleted internally by the InnoDB engine. The assumption is that when a replica applies the DELETE
on parent_table
the replica’s own InnoDB engine will likewise delete the two relevant child_table
rows.
Fair assumption. But we lose information along the way. As Change Data Captures are becoming more and more common, and as we stream changes from MySQL to other data stores, the DELETE
s on child_table
are never reflected and cannot be captured.
I’ve written about this at length in the past. But even that write up is incomplete!
MySQL is pushing towards INSTANT
DDL, which is a wonderful thing. With 8.0.29
, even more schema change operations are supported by ALGORITHM=INSTANT
. But, there’s still quite a lot of operations unsupported yet, and until such time that INSTANT
DDL supports all (or at least all common) schema changes, Online Schema Change tools like gh-ost
, pt-online-schema-change
, and Vitess
(disclaimer: I’m a Vitess
maintainer and actively developing Vitess'
s Online DDL), are essential when it comes to production changes.
Both Vitess
and gh-ost
tail the binary logs to capture changes to the table. In light of the previous section, it is impossible to run such an Online Schema Change operation on a foreign key child table that has either SET NULL
or CASCADE
rule. The changes to the table are never reflected in the binary log. pt-online-schema-change
is also unable to detect those changes as there’s nothing to invoke the triggers.
Then, please do go ahead and read The problem with MySQL foreign key constraints in Online Schema Changes, as it goes deep into what it otherwise means to deal with FK constraints in Online DDL, as it cannot fit in this post.
In the above table definitions, id
and parent_id
are int
. As data grows, I might realize the choice of data type was wrong. I really should have used bigint unsigned
.
Alas, it is impossible to change the data type in either parent_table
or child_table
:
> alter table parent_table modify column id bigint unsigned; ERROR 3780 (HY000): Referencing column 'parent_id' and referenced column 'id' in foreign key constraint 'child_parent_fk' are incompatible. > alter table child_table modify column parent_id bigint unsigned; ERROR 3780 (HY000): Referencing column 'parent_id' and referenced column 'id' in foreign key constraint 'child_parent_fk' are incompatible.
It’s impossible to do that with straight-DDL (never mind INSTANT
), and it’s impossible to do that with Online DDL. InnoDB (not MySQL) flatly refuses to accept any change in the related columns’s data type. Well, it’s not really about changing them as it is about having an incompatibility. But then, we can’t change either. The column type changes are only made possible if we modify the child table to remove the foreign key constraint, then alter both parent and child to modify the respective columns types, then re-add the foreign key constraint. There are four different ALTER TABLE
statements. Neither removing nor adding a foreign key constraint is supported in INSTANT
algorithm, so you can expect a long time in which the foreign key relationship simply does not exist!
One of those quirks that come with InnoDB owning the foreign key definition, is that CREATE TABLE ... LIKE
does not generate foreign keys. I think this is mostly an oversight. A SHOW CREATE TABLE
statement does produce foreign key output, so I’m not sure why CREATE TABLE ... LIKE
doesn’t. Continuing our above child_table
example:
> create table child_table_copy like child_table; Query OK, 0 rows affected (0.06 sec) > show create table child_table_copy \G *************************** 1. row *************************** Table: child_table_copy Create Table: CREATE TABLE `child_table_copy` ( `id` int NOT NULL, `parent_id` int DEFAULT NULL, PRIMARY KEY (`id`), KEY `parent_id_idx` (`parent_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
I know this is ANSI SQL, and so I won’t fault MySQL for this. I do think this is one of those scenarios where deviating from ANSI SQL would be beneficial. A foreign key constraint has a name (if you don’t provide one, one is auto-generated for you). And, that name, according to ANSI SQL, has to be unique across your schema. It means the following table conflicts with our original child_table
:
CREATE TABLE `another_child_table` ( `id` int NOT NULL, `parent_id` int DEFAULT NULL, PRIMARY KEY (`id`), KEY `parent_id_idx` (`parent_id`), CONSTRAINT `child_parent_fk` FOREIGN KEY (`parent_id`) REFERENCES `parent_table` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB;
You can’t have two foreign key constraints both named child_parent_fk
.
I never understood that limitation. See, it’s just fine the tables both have a key named parent_id_idx
. No conflict about that. Why do foreign keys have to have unique names?
Maybe, in ANSI SQL, foreign keys can be independent constructs, living outside the table scope. Meh, even so this could be technically solved using some sort of namespace. But, in MySQL this isn’t the case in the first place. Foreign keys are part of the table definition.
This is again just painful for Online DDL, or for any automation that tries to duplicate tables on the fly.
This is more of a “I wish this existed” rather than “this is wrong”. One of the greatest benefits of foreign keys is the graph. Given a schema with foreign keys, you can formally analyze the relationships between tables. You can draw the dependency graph. It’s really educating.
What I wish for is to have a declarative-only foreign key definition. One that does not actually enforce anything. Merely indicates an association. Something like so:
CREATE TABLE `child_table` ( `id` int NOT NULL, `parent_id` int DEFAULT NULL, PRIMARY KEY (`id`), KEY `parent_id_idx` (`parent_id`), DECLARATIVE FOREIGN KEY (`parent_id`) REFERENCES `parent_table` (`id`) )
The declarative foreign key could still enforce the existence of the parent table and referenced column, definition-wise, but do nothing at all to enforce relationship of data.
Anyway, just a wish.
I love that we have set foreign_key_checks
. But it’s a bit inconsistent. Basically, set foreign_key_checks=0
lets you override foreign key constraints. You can do any of the following:
INSERT
data to a child table even if the parent table does not have matching values.NO ACTION/RESTRICT
rule, DELETE
data from a parent table even if children tables have matching rows.SET NULL/CASCADE
rule, DELETE
data from a parent table without even attempting to cascade the change to children tables.CREATE TABLE child_table
that references parent_table
even if parent_table
does not exist.DROP TABLE parent_table
even if child_table
exists and is populated.But, why oh why, will set foreign_key_checks=0
not let me:
alter table parent_table modify column id bigint unsigned;
(column type relationship are still enforced)RENAME TABLE
statement (wishful feature, would really help Online DDL)This one becomes obvious as your data grows. If you use foreign keys and you rely on their behavior (e.g. your app relies on a DELETE
to fail if there’s dependent rows in children tables), and your data set grows such that a single server does not have the write capacity, you’re in trouble.
You may attempt to do functional sharding. You will hopefully find two subsets of your schema’s tables, that are not connected in the foreign key graph. If so, you win the day. But if it’s all connected, then you have to break some relationships. You’d have to audit your app. It previously assume the database would take care of data integrity, and now, for some relationships, it wouldn’t.
Or you may want to have horizontal sharding. If you mean to keep foreign key constraints, that means you need to find a way to co-locate data across the entire dependency graph. Unless this was pre-designed, you will probably find this to be impossible without a major refactor.
Vitess
is looking into FOREIGN KEY
implementation. It will attempt to address some of the above limitations. See https://github.com/vitessio/vitess/issues/11975 and https://github.com/vitessio/vitess/issues/12967 for some preliminary write ups and tracking.
Both orchestrator
and gh-ost
are popular tools in the MySQL ecosystem. They enjoy widespread adoption and are known to be used at prominent companies. Time and again I learn of more users of these projects. I used to keep a show-off list, I lost track since.
With wide adoption comes community engagement. This comes in the form of questions (“How do I…”, “Why does this not work…”, “Is it possible to…”), issues (crashing or data integrity bugs, locking issues, performance issues, etc.), suggestions (support this or that) and finally pull requests.
At this time, there’s multiple engagements per day. Between these two projects I estimate more than a full time job addressing those user interactions. That’s a full time job volume on top of an already existing full time job.
Much of this work went on employer’s time, but I have other responsibilities at work, too, and there is no room for a full-time-plus work on these projects. Responding to all community requests is unsustainable and futile. Some issues are left unanswered. Some pull requests are left open.
Even more demanding than time is context. To address a user’s bug report I’d need to re-familiarize myself with 5-year old code. That takes the toll of time but also memory and context switch. As community interaction goes, a simple discussion on an Issue can span multiple days. During those days I’d jump in and out of context. With multiple daily engagements this would mean re-familiarizing myself with different areas of the code, being able to justify a certain behavior; or have good arguments to why we should or should not change it; being able to simulate a scenario in my brain (I don’t have access to users’ environments); comprehend potential scenarios and understand what could break as result of what change — I don’t have and can’t practically have the tests to cover the myriad of scenarios, deployments, software, network and overall infrastructure in all users environments.
Even if I set designated time for community work, this still takes a toll on my daily tasks. The need to have a mental projection in your brain for all that’s open and all that’s to come makes it harder to free my mind and work on a new problem, to really immerse myself in thought, to create something new.
Effective immediately. I made some promises, and there’s a bunch of open issues and pull requests I intend to pursue, but going forward I’m going to disengage from further questions/requests/suggestions. I’m gonna turn off repo notifications and not get anything in my mailbox.
My intention is to step back, truly disengage, and see what happens. There’s a good chance (this happened before) that after some time I feel the itch to come back to working on these projects. Absolutely no commitments made here.
After 7 years of maintaining this project, first at Outbrain, then Booking.com, then GitHub and now at PlanetScale, I’m gonna step back and refrain from new developments, from responding to issues, from answering questions, from reviewing pull requests.
I should mention that in the past year or so, I’ve merged more community contributions than my own. That’s staggering! There are very capable contributors to this project.
In essence, the core of orchestrator
hasn’t changed in a while. The main logic remains the same. I suspect orchestrator
will remain effective for time to come. I am sure some users will be alarmed at this post, and wonder whether they should keep using orchestrator
or search for other solutions. I am in no position to make a suggestion. Users should carefully evaluate what’s in their best interests, what they deem to be stable and reliable software, what they deem to be supported or repairable, etc.
I co-designed and co-authored gh-ost
at GitHub (announcement) as part of the database infrastructure team. We wrote gh-ost
to solve a pressing issue of schema changes at GitHub, and were happy to open source it. This led to, frankly, an overwhelming response from the community, with very fast adoption. Within the first few months we received invaluable feedback, bug reports, suggestions, all of which had direct and positive impact to gh-ost
.
I’m not working at GitHub anymore, and I’m not an official maintainer of the upstream repo anymore. I do not have the authority to merge PRs or close issues. It is as it should be, the project is owned by GitHub.
I use gh-ost
as part of my current job at PlanetScale working on OSS Vitess. Vitess utilizes gh-ost
for online DDL. I therefore am an interested party in gh-ost
, most specifically to ensure it is correct and sound. For this reason, I selectively engage with users on GitHub’s repo, especially when it comes to issues I consider important for Vitess.
I do maintain a fork, where I either interact with users, or push my own changes. I collaborate with the GitHub team, contribute upstream changes I make on my fork, and pull changes downstream. The GitHub team is kind enough to accept my contributions and invest time in testing and evaluating what might be risky changes. The upstream and downstream code is mostly in sync.
Going forward I will continue to work on things critical to my current job, but otherwise I’ll be stepping away and reduce interactions. This means I will not accept pull requests or answer questions. The upstream gh-ost
repo remains under GitHub’s ownership and maintained by GitHub’s engineers. It is not in my authority to say how the upstream project will engage with the community and I do not presume to make suggestions.
I must say that I’m thoroughly humbled and grateful for the interactions on these projects. I hear of other OSS projects suffering abuse, but my work has seen respectful, thoughtful, empowering and inspiring user interactions. The majority of users invest time and thought in articulating an issue, or engage in respectful discussion while suggesting changes. I’ve actually “met” people through these interactions. I can only hope I payed back in same coin.
Community also provides assistance in several forms. The simplest and truly most helpful is by answering questions. Some community members will respond on issues, or on mailing lists, in chat rooms. Some users will identify similar issues to their own, opened by other users, will discuss and help each other, and share information.
Some companies and users are consistent contributors, working on issues that are both specific to their particular needs, as well as ultimately useful for the greater community.
At a previous time where I was overwhelmed with OSS/community work, two prominent companies, let’s call them S and P, stepped forward to offer actual development time; assign their own engineers part-time for a limited period to help pushing forward. I’m forever grateful for their kindness! I didn’t take those offers back then, because I didn’t have a good plan (I still don’t) for coordinating that kind of work; it felt like it would take even more efforts to set it up.
I don’t have a good plan for making this work, or for ensuring that this works well. I prefer that users fork orchestrator, and to not bring in contributors to this repo. If a contributor does have a solid plan, you probably know where to find me.
]]>Our discussion applies to pt-online-schema-change, gh-ost, and Vitess based migrations, or any other online schema change tool that works with a shadow/ghost table like the Facebook tools.
Online schema change tools come as workarounds to an old problem: schema migrations in MySQL were blocking, uninterruptible, aggressive in resources, replication unfriendly. Running a straight ALTER TABLE
in production means locking your table, generating high load on the primary, causing massive replication lag on replicas once the migration moves down the replication stream.
Yes. InnoDB supports Online DDL, where for many ALTER
types, your table remains unblocked throughout the migration. That’s an important improvement, but unfortunately not enough. Some migration types do not permit concurrent DDL (notably changing column data type, e.g. from INT
to BIGINT
). Migration is still aggressive and generates high load on your server. Replicas still run the migration sequentially. If your migration takes 5 hours to run concurrently on the primary, expect a 5 hour replication lag on your replica, i.e. complete loss of your fresh read capacity.
Yes. But unfortunately extremely limited. Mostly just for adding a new column. See here or again here. Instant DDLs showed great promise when introduced (contributed to MySQL by Tencent Games DBA Team) three years ago, and the hope was that MySQL would support many more types of ALTER TABLE
in INSTANT
DDL. At this time this has not happened yet, and we do with what we have.
True. But you don’t need to to be Google, or Facebook, or GitHub etc. scale to feel the pain of schema changes. Any non trivially sized table takes time to ALTER
, which results with lock/downtime. If your tables are limited to hundreds or mere thousands of small rows, you can get away with it. When your table grows, and a mere dozens of MB of data is enough, ALTER
becomes non-trivial at best case, and outright a cause of outage in a common scenario, in my experience.
In the relational model tables have relationships. A column in one table indicates a column in another table, so that a row in one table has a relationship one or more rows in another table. That’s the “foreign key”. A foreign key constraint is the enforcement of that relationship. A foreign key constraint is a database construct which watches over rows in different tables and ensures the relationship does not break. For example, it may prevent me from deleting a row that is in a relationship, to prevent the related row(s) from becoming orphaned.
No, this is a technical discussion (we’re getting there, I promise). But, for context:
I’ve been working on and around schema migration for many years now, and my current work on Vitess introduces some outrageous new super powers for schema migrations, which I can’t wait to present (and if you can’t wait, either, feel free to browse the public PRs, it’s free and open source).
Every once in a while, this pops up, on twitter, on Hacker News, on internal discussions. And the question gets asked: why can’t we support foreign keys?
And so this post explains why, technically, there’s an inherent problem in supporting foreign keys in Online Schema Changes. This is not about opinions for or against foreign keys.
Yes, no. Not quite, and I’ll elaborate as we dive into the details. And, to clarify, pt-online-schema-change
attempts to make the best of the situation. Back when developing gh-ost
, we saw that as a non-feasible solution. pt-online-schema-change
does a good job at explaining the restrictions and limitations of its foreign key support, and we will cover these and beyond, here.
OK, let’s dive in.
Consider the following extremely simplified model. Don’t judge me on the oversimplification, we just want to address the foreign keys issue here.
CREATE TABLE country ( id INT NOT NULL, name VARCHAR(255) NOT NULL, PRIMARY KEY (id) ); CREATE TABLE person ( id INT NOT NULL, country_id INT NOT NULL, name VARCHAR(255) NOT NULL, PRIMARY KEY(id), KEY country_idx (country_id), CONSTRAINT person_country_fk FOREIGN KEY (country_id) REFERENCES country(id) ON DELETE NO ACTION ); CREATE TABLE company ( id INT NOT NULL, country_id INT NOT NULL, name VARCHAR(255) NOT NULL, PRIMARY KEY(id), KEY country_idx (country_id), CONSTRAINT company_country_fk FOREIGN KEY (country_id) REFERENCES country(id) ON DELETE NO ACTION );
country
is a parent table in both relationshipperson
is a child table in relationship with country
company
is a child table in relationship with country
country
is a small table (maybe a couple hundred rows), and that both person
and company
are large tables (just, large enough to be a problem)ALTER TABLE
statement. This is where the turtles begin to pile up.RENAME
a parent table, for example, than children’s foreign keys follow the table under its new name. This is where our pillar of turtles becomes higher.NO ACTION
(aka RESTRICT
), but it doesn’t really matter to our discussion.SET FOREIGN_KEY_CHECKS=0
SET GLOBAL FOREIGN_KEY_CHECKS=0
, but this does not affect existing sessions, only ones created after your statement.gh-ost
, fb-osc
, pt-online-schema-change
,LHM
, and Vitess’s VReplication
, work by creating a “shadow” table, which I like to call the ghost table.
RENAME
the original table away, e.g. to _mytable_old
, and RENAME
the ghost table in its place, at which time it assumes production traffic.Say we want to ALTER TABLE person MODIFY name VARCHAR(1024) NOT NULL CHARSET utf8mb4
. Or add a column. Or an index. Whichever. Let’s see what happens.
person
has a foreign key. We therefore create the ghost table with similar foreign key, a child table that references the parent country
table. Funnily, even though InnoDB’s foreign keys live inside a table scope, their names are globally unique. So we create the ghost table as follows:
CREATE TABLE _person_ghost ( id INT NOT NULL, country_id INT NOT NULL, name VARCHAR(255) NOT NULL, PRIMARY KEY(id), KEY country_idx (country_id), CONSTRAINT person_country_fk2 FOREIGN KEY (country_id) REFERENCES country(id) ON DELETE NO ACTION );
pt-online-schema-change
is based on synchronous, same-transaction, data copy via triggers. At any point in time, if we populate _person_ghost
with a row, that row also exists in the original person
table during that same transaction. This means the data we insert to _person_ghost
is foreign key safe.gh-ost
, fb-osc
, Vitess
use an asynchronous approach where they tail either the binary logs or a changelog table. It is possible that as we INSERT
data to _person_ghost
, that data no longer exists in person
. It is possible that there’s no matching entry in country
! We can overcome that by disabling foreign key checks on our session/connection that populates the ghost table. We run SET FOREIGN_KEY_CHECKS=0
as make the server (and our users!) a promise, that even while populating the table there may be inconsistencies, we’ll figure it all out at time of cut-over._person_ghost
in place of person
.What have ended up with? Take a look:
The table person_OLD
still exists, and maintains a foreign key constraint on country
. Now, suppose we want to delete country
number 99
. We delete or update all rows in person
which point to country 99
. Good. We proceed to DELETE FROM country WHERE id=99
. We can’t. That’s because person_OLD
still has rows where country_id=99
.
To drop the foreign key constraint from person_old
is to ALTER TABLE person_old DROP FOREIGN KEY person_country_fk
. What’s that? An ALTER TABLE
? Wasn’t that the thing we wanted to avoid in the first place? There was a reason we ran an online schema change! So that’s an absolute no go.
pt-online-schema-change
offers --alter-foreign-keys-method drop_swap
: to get rid of the foreign key we can drop the old table. The logic it offers is:
DROP
the original table (e.g. person
)RENAME
the ghost table in its placeAlas, more turtles. Dropping a MySQL table is production is a cause for outage. Here’s a lengthy discussion form the gh-ost
repo. Digging my notes shows this post from 2010. This is an ancient problem where dropping a table places locks on buffer pool and on adaptive hash index, and there’s been multiple attempts to work around it. See Vitess’s table lifecycle for more.
Just a couple months ago, MySQL 8.0.23
release notes indicate that this bug is finally solved. I can’t wait to try it out. Most of the world is not on 8.0.23
yet and until it is, DROP
is a problem.
In my personal experience, if you can’t afford to run a straight ALTER
on a table, it’s likely you can’t afford to DROP
it.
As pt-online-schema-change
documentation correctly point out, we cause a brief time of outage after we DROP
the person
table, and before we RENAME TABLE _person_ghost TO person
. This is unfortunate, but, assuming DROP
is instantaneous, is indeed brief.
Assuming MySQL 8.0.23
with instantaneous DROP
, altering a table with child-side-only constraint is feasible. Without instantaneous DROP
, the migration can be as blocking as a straight ALTER
.
I regret to inform that from here things only get worse.
What happens if we naively try to ALTER TABLE country ADD COLUMN currency VARCHAR(16) NOT NULL
?
We create a ghost table, we populate the ghost table, we cut-over, and… End up with:
Our naive approach fails miserably. As we RENAME TABLE country to country_OLD
, the children’s foreign keys, on person
and company
, followed the table entity into country_OLD
. We are now in a situation where there is no active constraint on country
, and we’re stuck with a legacy table that affects our production.
Other than the DROP
issue discussed above, this doesn’t solve the main problem, which is that we are left with no constraint on country
.
The shocking result of our naive experiment, is that if we want to ALTER TABLE country
, we must – concurrently somehow – also ALTER TABLE person
and – concurrently somehow – ALTER TABLE company
. On the children tables we need to DROP
the old foreign key, and create a new foreign key that points into country_ghost
.
That’s a lot to unpack.
pt-online-schema-change
offers --alter-foreign-keys-method rebuild_constraints
. In this method, just before we cut-over and RENAME
the tables, we iterate all children, and , one by one, run a straight ALTER TABLE
on each of the children to DROP
the old constraint and to ADD
the new constraint, pointing to country_ghost
(imminently? to be renamed to country
).
This must happen when the ghost table is in full sync with the original table, or else there can be violations. For pt-online-schema-change
, which uses synchronous in-transaction trigger propagation, this works. For gh-ost
, Vitess
etc., which use the asynchronous approach, this can only take place while we place a write lock on the original table.
As pt-online-schema-change
documentation correctly indicates, this makes sense only when the children are all very small tables.
This gets worse. Let’s break this down even more.
Best case is achieved when indeed all children tables are very small. Still, we need to place a lock, and either sequentially or concurrently ALTER
multiple such small tables.
In my experience, on databases that aren’t trivially small, the opposite is more common: children tables are much larger than parent tables, and running a straight ALTER
on children is just not feasible.
Even the best case scenario poses the complexity of recovering/rolling back from error. For example, in a normal online schema change, we set timeouts for DDLs. Like the final RENAME
. If something doesn’t work out, we timeout the DDL, take a step back, and try cutting-over again later on. But our situation is much more complex now. While we keep a write lock, we must run multiple DDLs on the children, repointing their foreign keys from the original country
table to country_ghost
. What if one of those DDLs fail? We are left in a limbo state. Some of the DDLs may have succeeded. We’d need to either revert them, introducing even more DDLs (remember, we’re still holding locks), or retry that failing DDL. Those are a lot of DDLs to synchronize at the same time, even when they’re at all feasible.
In our scenario, person
and company
are large tables. A straight ALTER
table is just not feasible. We began this discussion assuming there’s a problem with ALTER
in the first place.
Also, for asynchronous online schema changes the situation is much more complex since we need to place more locks.
There’s an alluring thought. We bite, and illustrate what it would take to run an online schema change on each of the large children, concurrently to, and coordinated with, an online schema change on the parent.
We want the children to point their FK to country_ghost
. So we must kick the migration on each child after the parent’s migration creates the ghost table, and certainly before cut-over.
Initially, the parent’s ghost table is empty, or barely populated. Isn’t that a problem? Pointing to a parent table which is not even populated? Fortunately for us, we again remember we can disable foreign key checks as our OSC tool populates the child table. Sure, everything is broken at first, but we promise the server and the user that we will figure it all out at cut-over time.
So far, looks like we have a plan. We need to catch that notification that country_ghost
table is created, and we kick an online migration on person
and on company
.
We absolutely can’t cut-over country
before person
and company
are complete. That’s why we embarked on altering the children in the first place. We must have the children’s foreign keys point to country_ghost
before cutting it over.
But now, we need to also consider: when is it safe to cut-over person
and company
? It is only safe to cut-over when referential integrity is guaranteed. We remember that throughout the parent’s migration there’s no such guarantee. surely not while the table gets populated. And for asynchronous-based migrations, even after that, because the ghost table always “lags” a bit behind the original table.
The only way to provide referential integrity guarantee for asynchronous based migrations is when we place a write lock on the parent table (country
). We bite. We lock the table for writes, and sync up country_ghost
until we’re satisfied both are in complete sync. Now’s logically a safe time to cut-over the children.
But notice: this is a single, unique time, where we must cut-over all children, or none. This gets worse.
In the best scenario, we place a lock on country
, sync up country_ghost
, hold the lock, then iterate all children, and cut-over each. All children operations are successful. We cut-over the parent.
But this best case scenario depends on getting the best case scenario on each of the children, to its own. Remember, an ALTER
on a child table means we have to DROP
the child’s old table. Recall the impact it has in production. Now multiply by n
children. The ALTER
on country
, and while holding a write lock, will need to sustain survive DROP
on both person_OLD
and company_OLD
. This ie best case.
We don’t have the room for problems. Suppose person
cuts over, and we DROP
person_OLD
. But then company
fails to cut-over. There’s DDL timeout.
We can’t roll back. person
is now committed to company_ghost
. We can try cutting over company
again, and again, and again. But we may not fail. During these recurring attempts we must keep the lock on country
. And try again company
. Did it succeed? Phew. We can cut-over country
and finally remove the lock.
But what if something really fails? Pro tip: it most certainly happens.
If person
made it, and company
does not – if company
‘s migration breaks, fails, panics, gets killed, goes into seemingly infinite deadlocks, is unable to cut-over — whichever — we’re left in inconsistent and impossible scenario. person
is committed to company_ghost
, but company
is still committed to country
. We have to keep that lock on country
and run a new migration on company
! and again, and again. Meanwhile, country
is locked. Oh yes, meanwhile person
is also locked. You can’t write to person
because you can’t verify that related rows exist in country
, because country
has a WRITE
lock.
I can’t stress this enough: the lock must not be released until all children tables are migrated. So, for our next turtle, what happens on a failover? We get referential integrity corruption, because locks don’t work across servers.
Remember that an OSC works by creating a ghost table and populating it until it is in sync with the original table. This effectively means requiring extra disk space at roughly the same volume as the original table.
In a perfect world, we’d have all the disk space we ever needed. In my experience we’re far from living in a perfect world. I’ve had migrations where we weren’t sure we had the disk space for a single table change.
If we are to ALTER
a parent, and as by product ALTER
all of its children, at the same time, we’d need enough free disk space for all volumes of affected tables, combined.
In fact, running out of disk space is one of the common reasons for failing an online schema change operation. Consider how low the tolerance is for parent-side schema migration errors. Consider that running out of disk space isn’t something that just gets solved by retrying the cut-over again, and again, … the disk space is not there.
Three migrations running concurrently will not run faster than three migrations running sequentially — that’s my experience backed with production experiments. In my experience they actually end up taking longer because they’re all fighting for same resources, and context switch matters, as back-off intervals pile up. Maybe there’s some scenario where they could run slightly faster?
Altering our 200 row country
table ends up taking hours and hours due to the large person
and country
tables. The time for a migration is roughly the sum of times for all dependent migrations!
Hmmm. Maybe on country
we should just run a straight ALTER
. I think so, that wins! But it only wins our particular scenario, as we see next.
The operational complexity of Online Schema Changes for parent-side foreign keys is IMO not feasible. We need to assume all child-side operations are feasible, first (I’m looking at you, DROP TABLE
), and we have almost zero tolerance to things going wrong. Coordinating multiple migrations is complex, and a failover at the wrong time may cause corruption
Truly, everything discussed thus far was a simplified situation. We introduce more turtles to our story. Let’s add this table:
CREATE TABLE person_company ( id INT NOT NULL AUTO_INCREMENT, person_id INT NOT NULL, company_id INT NOT NULL, start_at TIMESTAMP NOT NULL, end_at TIMESTAMP NULL, PRIMARY KEY(id), KEY person_idx (person_id), KEY company_idx (company_id), CONSTRAINT person_company_person_fk FOREIGN KEY (person_id) REFERENCES person(id) ON DELETE NO ACTION, CONSTRAINT person_company_company_fk FOREIGN KEY (company_id) REFERENCES company(id) ON DELETE NO ACTION );
person_company
is a child of person
and of company
. It’s actually enough that it’s a child of one of them. What’s important is that now person
is both a child table and a parent table. So is company
. This is a pretty common scenario in schema designs.
ALTER
a table that is both a parent and a child?We introduce no new logic here, we “just” have to combine the logic for both. Given person_company
exists, if we wanted to ALTER TABLE person
we’d need to:
person
as a child table (implies DROP
issue and outage)person
as a parent (implies altering person_company
and synchronizing the cut-over)So how do we alter country
now?
To ALTER TABLE country
, we’d need to:
country
OSC, wait till country_ghost
is createdperson
OSC, wait till person_ghost
is created, andcompany
OSC, wait till company_ghost
is createdperson_company
country
. while this lock is in place:
person
migration. Place lock on person
, andcompany
migration. Place lock on company
.person_company
.DROP person_company_OLD
!person_company
!DROP company_OLD!
company
!
DROP person_OLD
!person
!country
!And we have near zero tolerance to any failure in the above, and we can’t afford a failover during that time…
It would all be better if we could just run ALTER TABLE
in MySQL and have it truly online, throttling, and on replicas, too. This doesn’t exists and our alternative is mostly Online Schema change tools, where, IMO, handing foreign key constraints on large tables is not feasible.
There’s an alternative to Online Schema change, which is to ALTER
on replicas. That comes with its own set of problems, and for this blog post I just ran out of fumes. For another time!
This was a no-slides, all command-line walkthrough of some of orchestrator
‘s capabilities, highlighting refactoring, topology analysis, takeovers and failovers, and discussing a bit of scripting and HTTP API tips.
The recording is available on YouTube (also embedded on https://dbama.now.sh/#history).
To present orchestrator
, I used the new shiny docker CI environment; it’s a single docker image running orchestrator
, a 4-node MySQL replication topology (courtesy dbdeployer), heartbeat injection, Consul
, consul-template
and HAProxy
. You can run it, too! Just clone the orchestrator
repo, then run:
./script/dock system
From there, you may follow the same playbook I used in the presentation, available as orchestrator-demo-playbook.sh.
Hope you find the presentation and the playbook to be useful resources.
]]>In the past four years orchestrator
was developed at GitHub, and using GitHub’s environments for testing. This is very useful for testing orchestrator
‘s behavior within GitHub, interacting with its internal infrastructure, and validating failover behavior in a production environment. These tests and their results are not visible to the public, though.
Now that orchestrator
is developed outside GitHub (that is, outside GitHub the company, not GitHub the platform) I wanted to improve on the testing framework, making it visible, accessible and contribute-able to the community. Thankfully, the GitHub platform has much to offer on that front and orchestrator
now uses GitHub Actions more heavily for testing.
GitHub Actions provide a way to run code in a container in the context of the repository. The most common use case is to run CI tests on receiving a Pull Request. Indeed, when GitHub Actions became available, we switched out of Travis CI and into Actions for orchestrator
‘s CI.
Today, orchestrator
runs three different tests:
To highlight what each does:
Based on the original CI (and possibly will split into distinct tests), this CI Action compiles the code, runs unit tests, runs the suite of integration tests (spins up both MySQL
and SQLite
databases and runs a series of tests on each backend), this CI job is the “basic” test to see that the contributed code even makes sense.
What’s new in this test is that it now produces an artifact: an orchestrator
binary for Linux/amd64. This is again a feature for GitHub Actions; the artifact is kept for a couple months or so per Actions retention policy. Here‘s an example; by the time you read this the binary artifact may or may not still be there.
This means you don’t actually need a development environment on your laptop to be able to build and orchestrator
binary. More on this later.
Until recently not formalized; I’d test upgrades by deploying them internally at GitHub onto a staging environment. Now upgrades are tested per Pull Request: we spin up a container, deploy orchestrator
from master
branch using both MySQL
and SQLite
backends, then checkout the PR branch, and redeploy orchestrator
using the existing backends — this verifies that at least backend-database wise, there’s not upgrade errors.
At this time the test only validates the database changes are applicable; in the future this may expand onto more elaborate tests.
I’m most excited about this one. Taking ideas from our approach to testing gh-ost with dbdeployer, I created https://github.com/openark/orchestrator-ci-env, which offers a full blown testing enviroment for orchestrator
, including a MySQL replication topology (courtesy dbdeployer), Consul, HAProxy and more.
This CI testing environment can also serve as a playground in your local docker setup, see shortly.
The system tests suite offers full blown cluster-wide operations such as graceful takeovers, master failovers, errant GTID transaction analysis and recovery and more. The suite utilizes the CI testing environment, breaks it, rebuilds it, validates it… Expects specific output, expects specific failure messages, specific analysis, specific outcomes.
As example, with the system tests suite, we can test the behavior of a master failover in a multi-DC, multi-region (obviously simulated) environment, where a server marked as “candidate” is lagging behind all others, with strict rules for cross-site/cross-region failovers, and still we wish to see that particular replica get promoted as master. We can test not only the topology aspect of the failover, but also the failover hooks, Consul integration and its effects, etc.
There’s now multiple options for developers/contributors to build or just try out orchestrator
.
As mentioned earlier, you actually don’t need a development environment. You can use orchestrator
CI to build and generate a Linux/amd64 orchestrator
binary, which you can download & deploy as you see fit.
I’ve signed up for the GitHub Codespaces beta program, and hope to make that available for orchestrator
, as well.
orchestrator
offers various Docker build/run environments, accessible via the script/dock
script:
alpine
linuxThis is the orchestrator
amusement park. Run script/dock system
to spawn the aforementioned CI environment used in system tests, and on top of that, an orchestrator
setup fully integrated with that system.
So that’s an orchestrator
-MySQL topology-Consul-HAProxy setup, where orchestrator
already has the credentials for, and pre-loads the MySQL topology, pre-configured to update Consul upon failover, HAProxy config populated by consul-template
, heartbeat injection, and more. It resembles the HA setup at GitHub, and in the future I expect to provide alternate setups (on top).
Once in that docker environment, one can try running relocations, failovers, test orchestrator
‘s behavior, etc.
GitHub recently announced GitHub Discussions ; think a stackoverflow like place within one’s repo to ask questions, discuss, vote on answers. It’s expected to be available this summer. When it does, I’ll encourage the community to use it instead of today’s orchestrator-mysql Google Group and of course the many questions posted as Issues.
There’s been a bunch of PRs merged recently, with more to come later on. I’m grateful for all contributions. Please understand if I’m still slow to respond.
]]>planet.mysql.com (formerly planetmysql.com) serves as a blog aggregator, and collects news and blog posts on various MySQL and its ecosystem topics. It collects some vendor and team blogs as well as “indie” blogs such as this one.
It has traditionally been the go-to place to catch up on the latest developments, or to read insightful posts. This blog itself has been aggregated in Planet MySQL for some eleven years.
Planet MySQL used to be owned by the MySQL community team. This recently changed with unwelcoming implications for the community.
I recently noticed how a blog post of mine, The state of Orchestrator, 2020 (spoiler: healthy), did not get aggregated in Planet MySQL. After a quick discussion and investigation, it was determined (and confirmed) it was filtered out because it contained the word “MariaDB”. It has later been confirmed that Planet MySQL now filters out posts indicating its competitors, such as MariaDB, PostgreSQL, MongoDB.
Planet MySQL is owned by Oracle and it is their decision to make. Yes, logic implies they would not want to publish a promotional post for a competitor. However, I wish to explain how this blind filtering negatively affects the community.
But, before that, I’d like to share that I first attempted to reach out to whoever is in charge of Planet MySQL at this time (my understanding is that this is a marketing team). Sadly, two attempts at reaching out to them individually, and another attempt at reaching out on behalf of a small group of individual contributors, yielded no response. The owners would not have audience with me, and would not hear me out. I find it disappointing and will let others draw morals.
We recognize that planet.mysql.com is an important information feed. It is responsible for a massive ratio of the traffic on my blog, and no doubt for many others. Indie blog posts, or small-team blog posts, practically depend on planet.mysql.com to get visibility.
And this is particularly important if you’re an open source developer who is trying to promote an open source project in the MySQL ecosystem. Without this aggregation, you will get significantly less visibility.
But, open source projects in the MySQL ecosystem do not live in MySQL vacuum, and typically target/support MySQL, Percona Server and MariaDB. As examples:
skeema needs to recognize MariaDB features not present in MySQL
ProxySQL needs to support MariaDB Galera queries
orchestrator needs to support MariaDB’s GTID flavor
Consider that a blog post of the form “Project version 1.2.3 now released!” is likely to mention things like “fixed MariaDB GTID setup” or “MariaDB 10.x now supported” etc. Consider just pointing out that “PROJECT X supports MySQL, MariaDB and Percona Server”.
Consider that merely mentioning “MariaDB” gets your blog post filtered out on planet.mysql.com. This has an actual impact on open source development in the MySQL ecosystem. We will lose audience and lose adoption.
I believe the MySQL ecosystem as a whole will be negatively affected as result, and this will circle back to MySQL itself. I believe this goes against the very interests of Oracle/MySQL.
I’ve been around the MySQL community for some 12 years now. From my observation, there is no doubt that MySQL would not thrive as it does today, without the tooling, blogs, presentations and general advice by the community.
This is more than an estimation. I happen to know that, internally at MySQL, they have used or are using open source projects from the community, projects whose blog posts get filtered out today because they mention “MariaDB”. I find that disappointing.
I have personally witnessed how open source developments broke existing barriers to enable companies to use MySQL at greater scale, in greater velocity, with greater stability. I was part of such companies and I’ve personally authored such tools. I’m disappointed that planet.mysql.com filters out my blog posts for those tools and without giving me audience, and extend my disappointment for all open source project maintainers.
At this time I consider planet.mysql.com to be a marketing blog, not a community feed, and do not want to participate in its biased aggregation.
]]>Thank you to Tom Krouper who applied his operational engineer expertise to content publishing problems.
]]>orchestrator
. First, a quick historical review:
orchestrator
at Outbrain, as https://github.com/outbrain/orchestrator. I authored several open source projects while working for Outbrain, and created orchestrator
to solve discovery, visualization and simple refactoring needs. Outbrain was happy to have the project developed as a public, open source repo from day 1, and it was released under the Apache 2 license. Interestingly, the idea to develop orchestrator
came after I attended Percona Live Santa Clara 2014 and watched “ChatOps: How GitHub Manages MySQL” by one Sam Lambert.orchestrator
, pursuing better failure detection and recovery processes. Booking.com was an incredible playground and testbed for orchestrator
, a massive deployment of multiple MySQL/MariaDB flavors and configuration.orchestrator
and I developed it under GitHub’s own org, at https://github.com/github/orchestrator. It became a core component in github.com’s high availability design, running failure detection and recoveries across sites and geographical regions, with more to come. These 4+ years have been critical to orchestrator
‘s development and saw its widespread use. At this time I’m aware of multiple large-scale organizations using orchestrator
for high availability and failovers. Some of these are GitHub, Booking.com, Shopify, Slack, Wix, Outbrain, and more. orchestrator
is the underlying failover mechanism for vitess, and is also included in Percona’s PMM. These years saw a significant increase in community adoption and contributions, in published content, such as Pythian and Percona technical blog posts, and, not surprisingly, increase in issues and feature requests.GitHub was very kind to support moving the orchestrator
repo under my own https://github.com/openark org. This means all issues, pull requests, releases, forks, stars and watchers have automatically transferred to the new location: https://github.com/openark/orchestrator. The old links do a “follow me” and implicitly direct to the new location. All external links to code and docs still work. I’m grateful to GitHub for supporting this transfer.
I’d like to thank all the above companies for their support of orchestrator
and of open source in general. Being able to work on the same product throughout three different companies is mind blowing and an incredible opportunity. orchestrator
of course remains open source and licensed with Apache 2. Existing Copyrights are unchanged.
As for what’s next: some personal time off, please understand if there’s delays to reviews/answers. My intention is to continue developing orchestrator
. Naturally, the shape of future development depends on how orchestrator
meets my future work. Nothing changes in that respect: my focus on orchestrator
has always been first and foremost the pressing business needs, and then community support as possible. There are some interesting ideas by prominent orchestrator
users and adopters and I’ll share more thoughts in due time.
]]>
0041e600-f1be-11e9-9759-a0369f9435dc:1-3772242
or multi-ranges, e.g. 24a83cd3-e30c-11e9-b43d-121b89fcdde6:1-103775793, 2efbcca6-7ee1-11e8-b2d2-0270c2ed2e5a:1-356487160, 46346470-6561-11e9-9ab7-12aaa4484802:1-26301153, 757fdf0d-740e-11e8-b3f2-0a474bcf1734:1-192371670, d2f5e585-62f5-11e9-82a5-a0369f0ed504:1-10047
.
One of the common problems in asynchronous replication is the issue of consistent reads. I’ve just written to the master
. Is the data available on a replica yet? We have iterated on this, from reading on master
, to heuristically finding up-to-date replicas based on heartbeats (see presentation and slides) via freno, and now settled, on some parts of our apps, to using GTID.
GTIDs are reliable as any replica can give you a definitive answer to the question: have you applied a given transaction or not?. Given a GTID entry, say f7b781a9-cbbd-11e9-affb-008cfa542442:12345
, one may query for the following on a replica:
mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:12345', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
| 1 |
+-------------------+
mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:123450000', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
| 0 |
+-------------------+
This is all well, but, given some INSERT
or UPDATE
on the master
, how can I tell what’s the GTID associated with that transaction? There\s good news and bad news.
SET SESSION session_track_gtids = OWN_GTID
. This makes the MySQL protocol return the GTID generated by your transaction.At GitHub we author our own Ruby driver, and have implemented the functionality to extract OWN_GTID
, much like you’d extract LAST_INSERT_ID
. But, how does one solve that without modifying the drivers? Here’s a poor person’s solution which gives you an inexact, but good enough, info. Following a write (insert
, delete
, create
, …), run:
select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
The idea is to “clean” the executed GTID set from irrelevant entries, by filtering out all ranges that do not belong to the server you’ve just written to (the master
). The number 1000000000000000
stands for “high enough value that will never be reached in practice” – set to your own preferred value, but this value should take you beyond 300
years assuming 100,000
transactions per second.
The value you get is the range on the master itself. e.g.:
mysql> select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
+-------------------------------------------------+
| master_generated_gtid |
+-------------------------------------------------+
| dc103953-1598-11ea-82a7-008cfa5440e4:1-35807176 |
+-------------------------------------------------+
You may further parse the above to extract dc103953-1598-11ea-82a7-008cfa5440e4:35807176
if you want to hold on to the latest GTID entry. Now, this entry isn’t necessarily your own. Between the time of your write and the time of your GTID query, other writes will have taken place. But the entry you get is either your own or a later one. If you can find that entry on a replica, that means your write is included on the replica.
One may wonder, why do we need to extract the value at all? Why not just select @@global.gtid_executed
? Why filter only the master
‘s UUID? Logically, the answer is the same if you do that. But in practice, your query may be unfortunate enough to return some:
select @@global.gtid_executed \G
e71f0cdb-b8ef-11e9-9361-008cfa542442:1-83331,
e742d87f-dea7-11e9-be6d-008cfa542c9e:1-18485,
e7880c0e-ac54-11e9-865a-008cfa544064:1-7331973,
e82043c6-c7d9-11e9-9413-008cfa5440e4:1-61692,
e902678b-b046-11e9-a281-008cfa542c9e:1-83108,
e90d7ff9-e35e-11e9-a9a0-008cfa544064:1-18468,
e929a635-bb40-11e9-9c0d-008cfa5440e4:1-139348,
e9351610-ef1b-11e9-9db4-008cfa5440e4:1-33460918,
e938578d-dc41-11e9-9696-008cfa542442:1-18232,
e947f165-cd53-11e9-b7a1-008cfa5440e4:1-18480,
e9733f37-d537-11e9-8604-008cfa5440e4:1-18396,
e97a0659-e423-11e9-8433-008cfa542442:1-18237,
e98dc1f7-e0f8-11e9-9bbd-008cfa542c9e:1-18482,
ea16027a-d20e-11e9-9845-008cfa542442:1-18098,
ea1e1aa6-e74a-11e9-a7f2-008cfa544064:1-18450,
ea8bc1bd-dd06-11e9-a10c-008cfa542442:1-18203,
eae8c750-aaca-11e9-b17c-008cfa544064:1-85990,
eb1e41e9-af81-11e9-9ceb-008cfa544064:1-86220,
eb3c9b3b-b698-11e9-b67a-008cfa544064:1-18687,
ec6daf7e-b297-11e9-a8a0-008cfa542c9e:1-80652,
eca4af92-c965-11e9-a1f3-008cfa542c9e:1-18333,
ecd110b9-9647-11e9-a48f-008cfa544064:1-24213,
ed26890e-b10b-11e9-a79d-008cfa542c9e:1-83450,
ed92b3bf-c8a0-11e9-8612-008cfa542442:1-18223,
eeb60c82-9a3d-11e9-9ea5-008cfa544064:1-1943152,
eee43e06-c25d-11e9-ba23-008cfa542442:1-105102,
eef4a7fb-b438-11e9-8d4b-008cfa5440e4:1-74717,
eefdbd3b-95b3-11e9-833d-008cfa544064:1-39415,
ef087062-ba7b-11e9-92de-008cfa5440e4:1-9726172,
ef507ff0-98b3-11e9-8b15-008cfa5440e4:1-928030,
ef662471-9a3b-11e9-bd2e-008cfa542c9e:1-954800,
f002e9f7-97ee-11e9-bed0-008cfa542c9e:1-5180743,
f0233228-e9a1-11e9-a142-008cfa542c9e:1-18583,
f04780c4-a864-11e9-9f28-008cfa542c9e:1-83609,
f048acd9-b1d2-11e9-a0b6-008cfa544064:1-70663,
f0573d8c-9978-11e9-9f73-008cfa542c9e:1-85642135,
f0b0a37c-c89c-11e9-804c-008cfa5440e4:1-18488,
f0cfe1ac-e5af-11e9-bc09-008cfa542c9e:1-18552,
f0e4997c-cbc9-11e9-9179-008cfa542442:1-1655552,
f24e481c-b5c4-11e9-aff0-008cfa5440e4:1-83015,
f4578c4b-be6d-11e9-982e-008cfa5440e4:1-132701,
f48bce80-e99f-11e9-94f4-a0369f9432f4:1-18460,
f491adf1-9b04-11e9-bc71-008cfa542c9e:1-962823,
f5d3db74-a929-11e9-90e8-008cfa5440e4:1-75379,
f6696ba7-b750-11e9-b458-008cfa542c9e:1-83096,
f714cb4c-dab7-11e9-adb9-008cfa544064:1-18413,
f7b781a9-cbbd-11e9-affb-008cfa542442:1-18169,
f81f7729-b10d-11e9-b29b-008cfa542442:1-86820,
f88a3298-e903-11e9-88d0-a0369f9432f4:1-18548,
f9467b29-d78c-11e9-b1a2-008cfa5440e4:1-18492,
f9c08f5c-e4ea-11e9-a76c-008cfa544064:1-1667611,
fa633abf-cee3-11e9-9346-008cfa542442:1-18361,
fa8b0e64-bb42-11e9-9913-008cfa542442:1-140089,
fa92234c-cc90-11e9-b337-008cfa544064:1-18324,
fa9755eb-e425-11e9-907d-008cfa542c9e:1-1668270,
fb7843d5-eb38-11e9-a1ff-a0369f9432f4:1-1668957,
fb8ceae5-dd08-11e9-9ed3-008cfa5440e4:1-18526,
fbf9970e-bc07-11e9-9e4f-008cfa5440e4:1-136157,
fc0ffaee-98b1-11e9-8574-008cfa542c9e:1-940999,
fc9bf1e4-ee54-11e9-9ce9-008cfa542c9e:1-18189,
fca4672f-ac56-11e9-8a83-008cfa542442:1-82014,
fcebaa05-dab5-11e9-8356-008cfa542c9e:1-18490,
fd0c88b1-ad1b-11e9-bf3a-008cfa5440e4:1-75167,
fd394feb-e4e4-11e9-bd09-008cfa5440e4:1-18574,
fd687577-b048-11e9-b429-008cfa542442:1-83479,
fdb18995-a79f-11e9-a28d-008cfa542442:1-82351,
fdc72b7f-b696-11e9-ade9-008cfa544064:1-57674,
ff1f3b6b-c967-11e9-ae04-008cfa544064:1-18503,
ff6fe7dc-c186-11e9-9bb4-008cfa5440e4:1-103192,
fff9dd94-ed95-11e9-90b7-008cfa544064:1-911039
This can happen when you fail over to a new master, multiple times; it happens when you don’t recycle UUIDs, when you provision new hosts and let MySQL pick their UUID. Returning this amount of data per query is an excessive overhead, hence why we extract the master
‘s UUID only, which is guaranteed to be limited in size.
I recently had the pleasure of presenting gh-mysql-rewind
at FOSDEM. Video and slides are available. Consider following along with the video.
Consider a split brain scenario: a “standard” MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.
Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?
The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT
values have been taken. UNIQUE
constraints will fail.
A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario – which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.
We had to make a choice: we’ve already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?
Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.
Could we have reduces the time for recovery?
There are multiple ways to do that: local backups, local delayed replicas, snapshots… We have embarked on several. In this post I wish to outline gh-mysql-rewind, which programmatically identifies the rogue (aka “bad”) transactions on the network isolated master, rewinds/reverts them, applies some bookkeeping and restores the demoted master as a healthy replica under the newly promoted master, thereby prepared to be promoted if needed.
gh-mysql-rewind
is a shell
script. It utilizes multiple technologies, some of which do not speak to each other, to be able to do the magic. It assumes and utilizes the following:
binlog_format=ROW
)binlog_row_image=FULL
Some breakdown follows.
MySQL GTIDs keep track of all transactions executed on a given server. GTIDs indicate which server (UUID) originated a write, and ranges of transaction sequences. In a clean state, only one writer will generate GTIDs, and on all the replicas we would see the same GTID set, originated with the writer’s UUID.
In a split brain scenario, we would see divergence. It is possible to use GTID_SUBTRACT(old_master-GTIDs, new-master-GTIDs) to identify the exact set of transactions executed on the old, demoted master, right after the failover. This is the essence of the split brain.
For example, assume that just before the network partition, GTID on the master was 00020192-1111-1111-1111-111111111111:1-5000
. Assume after the network partition the new master has UUID of 00020193-2222-2222-2222-222222222222
. It began to take writes, and after some time its GTID set showed 00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200
.
On the demoted master, other writes took place, leading to the GTID set 00020192-1111-1111-1111-111111111111:1-5042
.
We will run…
SELECT GTID_SUBTRACT(
'00020192-1111-1111-1111-111111111111:1-5042',
'00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200'
);
> '00020192-1111-1111-1111-111111111111:5001-5042'
…to identify the exact set of “bad transactions” on the demoted master.
With row based replication, and with FULL
image format, each DML (INSERT
, UPDATE
, DELETE
) writes to the binary log the complete row data before and after the operation. This means the binary log has enough information for us to revert the operation.
Developed by Alibaba, flashback
has been incorporated in MariaDB. MariaDB’s mysqlbinlog
utility supports a --flashback
flag, which interprets the binary log in a special way. Instead of printing out the events in the binary log in order, it prints the inverted operations in reverse order.
To illustrate, let’s assume this pseudo-code sequence of events in the binary log:
insert(1, 'a')
insert(2, 'b')
insert(3, 'c')
update(2, 'b')->(2, 'second')
update(3, 'c')->(3, 'third')
insert(4, 'd')
delete(1, 'a')
A --flashback
of this binary log would produce:
insert(1, 'a')
delete(4, 'd')
update(3, 'third')->(3, 'c')
update(2, 'second')->(2, 'b')
delete(3, 'c')
delete(2, 'b')
delete(1, 'a')
Alas, MariaDB and flashback
do not speak MySQL GTID language. GTIDs are one of the major points where MySQL and MariaDB have diverged beyond compatibility.
The output of MariaDB’s mysqlbinlog --flashback
has neither any mention of GTIDs, nor does the tool take notice of GTIDs in the binary logs in the first place.
This is where we step in. GTIDs provide the information about what went wrong. flashback
has the mechanism to generate the reverse sequence of statements. gh-mysql-rewind
:
mysqlbinlog --flashback
to generate the reverse of those binary logsThis last part is worth elaborating. We have created a time machine. We have the mechanics to make it work. But as any Sci-Fi fan knows, one of the most important parts of time travel is knowing ahead where (when) you are going to land. Are you back in the Renaissance? Or are you suddenly to appear on board the French Revolution? Better dress accordingly.
In our scenario it is not enough to move MySQL back in time to some consistent state. We want to know at what time we landed, so that we can instruct the rewinded server to join the replication chain as a healthy replica. In MySQL terms, we need to make MySQL “forget” everything that ever happened after the split brain: not only in terms of data (which we already did), but in terms of GTID history.
gh-mysql-rewind
will do the math to project, ahead of time, at what “time” (i.e. GTID set) our time machine arrived. It will issue a `RESET MASTER; SET GLOBAL gtid_purged=’gtid-of-the-landing-time'” to make our re-winded MySQL consistent not only with some past dataset, but also with its own perception of the point in time where that dataset existed.
Some limitations are due to MariaDB’s incompatibility with MySQL, some are due to MySQL DDL nature, some due to the fact gh-mysql-rewind
is a shell
script.
JSON
, POINT
data types are not supported.mysqlbinlog
as well as MariaDB’s mysqlbinlog
.There’s lot of moving parts to this mechanism. A mixture of technologies that don’t normally speak to each other, injection of data, prediction of ETA… How reliable is all this?
We run continuous gh-mysql-rewind
testing in production to consistently prove that it works as expected. Our testing uses a non-production, dedicated, functional replica. It contaminates the data on the replica. It lets gh-mysql-rewind
automatically move it back in time, it joins the replica back into the healthy chain.
That’s not enough. We actually create a scenario where we can predict, ahead of testing, what the time-of-arrival will be. We checksum the data on that replica at that time. After contaminating and effectively breaking replication, we expect gh-mysql-rewind
to revert the changes back to our predicted point in time. We checksum the data again. We expect 100% match.
See the video or slides for more detail on our testing setup.
At this time the tool in one of several solutions we hope to never need to employ. It is stable and tested. We are looking forward to a promising MySQL development that will provide GTID-revert capabilities using standard commands, such as SELECT undo_transaction('00020192-1111-1111-1111-111111111111:5042')
.
We have released gh-mysql-rewind
as open source, under the MIT license. The public release is a stripped down version of our own script, which has some GitHub-specific integration. We have general ideas in incorporating this functionality into higher level tools.
gh-mysql-rewind
is developed by the database-infrastructure team at GitHub.