code.openark.org

Things that don’t work well with MySQL’s FOREIGN KEY implementation

2023-05-02T04:01:54Z

Foreign keys are an important database construct, that let you keep an aspect of data integrity within your relational model. Data integrity has many faces, most of which are application-specific. Foreign keys let you maintain an association integrity between rows in two tables, one being the parent, the other being the child. I’ve personally mostly avoided using foreign keys in MySQL for many years in my professional work, for several reasons. I’d like to now highlight things I find to be wrong/broken with the MySQL implementation.

But, first, as in the past this caused some confusion: when I say I’m not using foreign keys, that does not mean I don’t JOIN tables. I still JOIN tables. I still have some id column in one table and some parent_id in another. I still use the benefit of the relational model. In a sense, I do use foreign keys. What I don’t normally is the foreign key CONSTRAINT, i.e. the declaration of a CONSTRAINT some_fk FOREIGN KEY ... in a table’s definition.

So here are things I consider to be broken, either specific to the MySQL implementation, or in the general concept. Some are outright deal breakers for environments I’ve worked with. Others are things to work around. In no particular order:

No binary log entries for cascaded writes

I think there are many people unaware of this. In a way, MySQL doesn’t really support foreign keys. The InnoDB engine does. This is old history, from before InnoDB was even officially a MySQL technology, and was developed independently as a 3rd party product. There was a time when MySQL sought alternative engines. There was a time when there was a plan to implement foreign keys in MySQL, above the storage engine level. But as history goes, MySQL and InnoDB both became one with Oracle acquiring both, and I’m only guessing implementing foreign keys in MySQL became lower priority, to be eventually abandoned.

Alas, the fact foreign keys are implemented in the storage engine level has dire consequences. The engine does not have direct access to the binary log. If you create a foreign key constraint with ON DELETE|UPDATE of SET NULL or CASCADE, you should be aware that cascaded operations are never written to the binary log. Consider these two tables:

CREATE TABLE `parent_table` (
  `id` int NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

CREATE TABLE `child_table` (
  `id` int NOT NULL,
  `parent_id` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `parent_id_idx` (`parent_id`),
  CONSTRAINT `child_parent_fk` FOREIGN KEY (`parent_id`) REFERENCES `parent_table` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;

insert into parent_table values (1);
insert into child_table values (1, 1);
insert into child_table values (2, 1);

If you were to DELETE FROM parent_table WHERE id=1, then the two rows in child_table are also deleted, due to the CASCADE rule. However, only the parent_table deleted row is written in the binary log. The two child_table rows are deleted internally by the InnoDB engine. The assumption is that when a replica applies the DELETE on parent_table the replica’s own InnoDB engine will likewise delete the two relevant child_table rows.

Fair assumption. But we lose information along the way. As Change Data Captures are becoming more and more common, and as we stream changes from MySQL to other data stores, the DELETEs on child_table are never reflected and cannot be captured.

Online DDL, aka online schema changes

I’ve written about this at length in the past. But even that write up is incomplete!

MySQL is pushing towards INSTANT DDL, which is a wonderful thing. With 8.0.29, even more schema change operations are supported by ALGORITHM=INSTANT. But, there’s still quite a lot of operations unsupported yet, and until such time that INSTANT DDL supports all (or at least all common) schema changes, Online Schema Change tools like gh-ost, pt-online-schema-change, and Vitess (disclaimer: I’m a Vitess maintainer and actively developing Vitess's Online DDL), are essential when it comes to production changes.

Both Vitess and gh-ost tail the binary logs to capture changes to the table. In light of the previous section, it is impossible to run such an Online Schema Change operation on a foreign key child table that has either SET NULL or CASCADE rule. The changes to the table are never reflected in the binary log. pt-online-schema-change is also unable to detect those changes as there’s nothing to invoke the triggers.

Then, please do go ahead and read The problem with MySQL foreign key constraints in Online Schema Changes, as it goes deep into what it otherwise means to deal with FK constraints in Online DDL, as it cannot fit in this post.

Locked data types

In the above table definitions, id and parent_id are int. As data grows, I might realize the choice of data type was wrong. I really should have used bigint unsigned.

Alas, it is impossible to change the data type in either parent_table or child_table:

> alter table parent_table modify column id bigint unsigned;
ERROR 3780 (HY000): Referencing column 'parent_id' and referenced column 'id' in foreign key constraint 'child_parent_fk' are incompatible.

> alter table child_table modify column parent_id bigint unsigned;
ERROR 3780 (HY000): Referencing column 'parent_id' and referenced column 'id' in foreign key constraint 'child_parent_fk' are incompatible.

It’s impossible to do that with straight-DDL (never mind INSTANT), and it’s impossible to do that with Online DDL. InnoDB (not MySQL) flatly refuses to accept any change in the related columns’s data type. Well, it’s not really about changing them as it is about having an incompatibility. But then, we can’t change either. The column type changes are only made possible if we modify the child table to remove the foreign key constraint, then alter both parent and child to modify the respective columns types, then re-add the foreign key constraint. There are four different ALTER TABLE statements. Neither removing nor adding a foreign key constraint is supported in INSTANT algorithm, so you can expect a long time in which the foreign key relationship simply does not exist!

CREATE TABLE … LIKE

One of those quirks that come with InnoDB owning the foreign key definition, is that CREATE TABLE ... LIKE does not generate foreign keys. I think this is mostly an oversight. A SHOW CREATE TABLE statement does produce foreign key output, so I’m not sure why CREATE TABLE ... LIKE doesn’t. Continuing our above child_table example:

> create table child_table_copy like child_table;
Query OK, 0 rows affected (0.06 sec)

> show create table child_table_copy \G
*************************** 1. row ***************************
       Table: child_table_copy
Create Table: CREATE TABLE `child_table_copy` (
  `id` int NOT NULL,
  `parent_id` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `parent_id_idx` (`parent_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Unique constraint names

I know this is ANSI SQL, and so I won’t fault MySQL for this. I do think this is one of those scenarios where deviating from ANSI SQL would be beneficial. A foreign key constraint has a name (if you don’t provide one, one is auto-generated for you). And, that name, according to ANSI SQL, has to be unique across your schema. It means the following table conflicts with our original child_table:

CREATE TABLE `another_child_table` (
  `id` int NOT NULL,
  `parent_id` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `parent_id_idx` (`parent_id`),
  CONSTRAINT `child_parent_fk` FOREIGN KEY (`parent_id`) REFERENCES `parent_table` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;

You can’t have two foreign key constraints both named child_parent_fk.

I never understood that limitation. See, it’s just fine the tables both have a key named parent_id_idx. No conflict about that. Why do foreign keys have to have unique names?

Maybe, in ANSI SQL, foreign keys can be independent constructs, living outside the table scope. Meh, even so this could be technically solved using some sort of namespace. But, in MySQL this isn’t the case in the first place. Foreign keys are part of the table definition.

This is again just painful for Online DDL, or for any automation that tries to duplicate tables on the fly.

Lack of declarative-only definitions

This is more of a “I wish this existed” rather than “this is wrong”. One of the greatest benefits of foreign keys is the graph. Given a schema with foreign keys, you can formally analyze the relationships between tables. You can draw the dependency graph. It’s really educating.

What I wish for is to have a declarative-only foreign key definition. One that does not actually enforce anything. Merely indicates an association. Something like so:

CREATE TABLE `child_table` (
  `id` int NOT NULL,
  `parent_id` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `parent_id_idx` (`parent_id`),
  DECLARATIVE FOREIGN KEY (`parent_id`) REFERENCES `parent_table` (`id`)
)

The declarative foreign key could still enforce the existence of the parent table and referenced column, definition-wise, but do nothing at all to enforce relationship of data.

Anyway, just a wish.

SET FOREIGN_KEY_CHECKS

I love that we have set foreign_key_checks. But it’s a bit inconsistent. Basically, set foreign_key_checks=0 lets you override foreign key constraints. You can do any of the following:

INSERT data to a child table even if the parent table does not have matching values.
With NO ACTION/RESTRICT rule, DELETE data from a parent table even if children tables have matching rows.
With SET NULL/CASCADE rule, DELETE data from a parent table without even attempting to cascade the change to children tables.
CREATE TABLE child_table that references parent_table even if parent_table does not exist.
DROP TABLE parent_table even if child_table exists and is populated.

But, why oh why, will set foreign_key_checks=0 not let me:

alter table parent_table modify column id bigint unsigned;(column type relationship are still enforced)
Swap a new parent table using a RENAME TABLE statement (wishful feature, would really help Online DDL)

Limited to server scope

This one becomes obvious as your data grows. If you use foreign keys and you rely on their behavior (e.g. your app relies on a DELETE to fail if there’s dependent rows in children tables), and your data set grows such that a single server does not have the write capacity, you’re in trouble.

You may attempt to do functional sharding. You will hopefully find two subsets of your schema’s tables, that are not connected in the foreign key graph. If so, you win the day. But if it’s all connected, then you have to break some relationships. You’d have to audit your app. It previously assume the database would take care of data integrity, and now, for some relationships, it wouldn’t.

Or you may want to have horizontal sharding. If you mean to keep foreign key constraints, that means you need to find a way to co-locate data across the entire dependency graph. Unless this was pre-designed, you will probably find this to be impossible without a major refactor.

Vitess is looking into FOREIGN KEY implementation. It will attempt to address some of the above limitations. See https://github.com/vitessio/vitess/issues/11975 and https://github.com/vitessio/vitess/issues/12967 for some preliminary write ups and tracking.

Reducing my OSS involvement, and how it affects orchestrator & gh-ost

2021-07-20T12:14:56Z

I’m going to bring down my work volume around OSS to a minimum, specifically when it comes to orchestrator and gh-ost. This is to explain the whats and hows so that users are as informed as possible. TL;DR a period of time I will not respond to issues, will not review pull requests, will not produce releases, will not answer on mailing lists. That period of time is undefined. Could be as short as a few weeks, could be months, more, an unknown.

The “What”

Both orchestrator and gh-ost are popular tools in the MySQL ecosystem. They enjoy widespread adoption and are known to be used at prominent companies. Time and again I learn of more users of these projects. I used to keep a show-off list, I lost track since.

With wide adoption comes community engagement. This comes in the form of questions (“How do I…”, “Why does this not work…”, “Is it possible to…”), issues (crashing or data integrity bugs, locking issues, performance issues, etc.), suggestions (support this or that) and finally pull requests.

At this time, there’s multiple engagements per day. Between these two projects I estimate more than a full time job addressing those user interactions. That’s a full time job volume on top of an already existing full time job.

Much of this work went on employer’s time, but I have other responsibilities at work, too, and there is no room for a full-time-plus work on these projects. Responding to all community requests is unsustainable and futile. Some issues are left unanswered. Some pull requests are left open.

Even more demanding than time is context. To address a user’s bug report I’d need to re-familiarize myself with 5-year old code. That takes the toll of time but also memory and context switch. As community interaction goes, a simple discussion on an Issue can span multiple days. During those days I’d jump in and out of context. With multiple daily engagements this would mean re-familiarizing myself with different areas of the code, being able to justify a certain behavior; or have good arguments to why we should or should not change it; being able to simulate a scenario in my brain (I don’t have access to users’ environments); comprehend potential scenarios and understand what could break as result of what change — I don’t have and can’t practically have the tests to cover the myriad of scenarios, deployments, software, network and overall infrastructure in all users environments.

Even if I set designated time for community work, this still takes a toll on my daily tasks. The need to have a mental projection in your brain for all that’s open and all that’s to come makes it harder to free my mind and work on a new problem, to really immerse myself in thought, to create something new.

When? For how long?

Effective immediately. I made some promises, and there’s a bunch of open issues and pull requests I intend to pursue, but going forward I’m going to disengage from further questions/requests/suggestions. I’m gonna turn off repo notifications and not get anything in my mailbox.

My intention is to step back, truly disengage, and see what happens. There’s a good chance (this happened before) that after some time I feel the itch to come back to working on these projects. Absolutely no commitments made here.

What does this mean for orchestrator?

After 7 years of maintaining this project, first at Outbrain, then Booking.com, then GitHub and now at PlanetScale, I’m gonna step back and refrain from new developments, from responding to issues, from answering questions, from reviewing pull requests.

I should mention that in the past year or so, I’ve merged more community contributions than my own. That’s staggering! There are very capable contributors to this project.

In essence, the core of orchestrator hasn’t changed in a while. The main logic remains the same. I suspect orchestratorwill remain effective for time to come. I am sure some users will be alarmed at this post, and wonder whether they should keep using orchestrator or search for other solutions. I am in no position to make a suggestion. Users should carefully evaluate what’s in their best interests, what they deem to be stable and reliable software, what they deem to be supported or repairable, etc.

What does this mean for gh-ost?

I co-designed and co-authored gh-ost at GitHub (announcement) as part of the database infrastructure team. We wrote gh-ost to solve a pressing issue of schema changes at GitHub, and were happy to open source it. This led to, frankly, an overwhelming response from the community, with very fast adoption. Within the first few months we received invaluable feedback, bug reports, suggestions, all of which had direct and positive impact to gh-ost.

I’m not working at GitHub anymore, and I’m not an official maintainer of the upstream repo anymore. I do not have the authority to merge PRs or close issues. It is as it should be, the project is owned by GitHub.

I use gh-ost as part of my current job at PlanetScale working on OSS Vitess. Vitess utilizes gh-ost for online DDL. I therefore am an interested party in gh-ost, most specifically to ensure it is correct and sound. For this reason, I selectively engage with users on GitHub’s repo, especially when it comes to issues I consider important for Vitess.

I do maintain a fork, where I either interact with users, or push my own changes. I collaborate with the GitHub team, contribute upstream changes I make on my fork, and pull changes downstream. The GitHub team is kind enough to accept my contributions and invest time in testing and evaluating what might be risky changes. The upstream and downstream code is mostly in sync.

Going forward I will continue to work on things critical to my current job, but otherwise I’ll be stepping away and reduce interactions. This means I will not accept pull requests or answer questions. The upstream gh-ost repo remains under GitHub’s ownership and maintained by GitHub’s engineers. It is not in my authority to say how the upstream project will engage with the community and I do not presume to make suggestions.

On community interaction

I must say that I’m thoroughly humbled and grateful for the interactions on these projects. I hear of other OSS projects suffering abuse, but my work has seen respectful, thoughtful, empowering and inspiring user interactions. The majority of users invest time and thought in articulating an issue, or engage in respectful discussion while suggesting changes. I’ve actually “met” people through these interactions. I can only hope I payed back in same coin.

On community assistance

Community also provides assistance in several forms. The simplest and truly most helpful is by answering questions. Some community members will respond on issues, or on mailing lists, in chat rooms. Some users will identify similar issues to their own, opened by other users, will discuss and help each other, and share information.

Some companies and users are consistent contributors, working on issues that are both specific to their particular needs, as well as ultimately useful for the greater community.

At a previous time where I was overwhelmed with OSS/community work, two prominent companies, let’s call them S and P, stepped forward to offer actual development time; assign their own engineers part-time for a limited period to help pushing forward. I’m forever grateful for their kindness! I didn’t take those offers back then, because I didn’t have a good plan (I still don’t) for coordinating that kind of work; it felt like it would take even more efforts to set it up.

Can we jump in as contributors?

I don’t have a good plan for making this work, or for ensuring that this works well. I prefer that users fork orchestrator, and to not bring in contributors to this repo. If a contributor does have a solid plan, you probably know where to find me.

The problem with MySQL foreign key constraints in Online Schema Changes

2021-03-17T15:21:25Z

This post explains the inherent problem of running online schema changes in MySQL, on tables participating in a foreign key relationship. We’ll lay some ground rules and facts, sketch a simplified schema, and dive into an online schema change operation.

Our discussion applies to pt-online-schema-change, gh-ost, and Vitess based migrations, or any other online schema change tool that works with a shadow/ghost table like the Facebook tools.

Why Online Schema Change?

Online schema change tools come as workarounds to an old problem: schema migrations in MySQL were blocking, uninterruptible, aggressive in resources, replication unfriendly. Running a straight ALTER TABLE in production means locking your table, generating high load on the primary, causing massive replication lag on replicas once the migration moves down the replication stream.

Isn’t there some Online DDL?

Yes. InnoDB supports Online DDL, where for many ALTER types, your table remains unblocked throughout the migration. That’s an important improvement, but unfortunately not enough. Some migration types do not permit concurrent DDL (notably changing column data type, e.g. from INT to BIGINT). Migration is still aggressive and generates high load on your server. Replicas still run the migration sequentially. If your migration takes 5 hours to run concurrently on the primary, expect a 5 hour replication lag on your replica, i.e. complete loss of your fresh read capacity.

Isn’t there some Instant DDL?

Yes. But unfortunately extremely limited. Mostly just for adding a new column. See here or again here. Instant DDLs showed great promise when introduced (contributed to MySQL by Tencent Games DBA Team) three years ago, and the hope was that MySQL would support many more types of ALTER TABLE in INSTANT DDL. At this time this has not happened yet, and we do with what we have.

Not everyone is Google or Facebook scale, right?

True. But you don’t need to to be Google, or Facebook, or GitHub etc. scale to feel the pain of schema changes. Any non trivially sized table takes time to ALTER, which results with lock/downtime. If your tables are limited to hundreds or mere thousands of small rows, you can get away with it. When your table grows, and a mere dozens of MB of data is enough, ALTER becomes non-trivial at best case, and outright a cause of outage in a common scenario, in my experience.

Let’s discuss foreign key constraints

In the relational model tables have relationships. A column in one table indicates a column in another table, so that a row in one table has a relationship one or more rows in another table. That’s the “foreign key”. A foreign key constraint is the enforcement of that relationship. A foreign key constraint is a database construct which watches over rows in different tables and ensures the relationship does not break. For example, it may prevent me from deleting a row that is in a relationship, to prevent the related row(s) from becoming orphaned.

Is this a biased post? We hear you don’t like foreign keys

No, this is a technical discussion (we’re getting there, I promise). But, for context:

I’ve been working on and around schema migration for many years now, and my current work on Vitess introduces some outrageous new super powers for schema migrations, which I can’t wait to present (and if you can’t wait, either, feel free to browse the public PRs, it’s free and open source).

Every once in a while, this pops up, on twitter, on Hacker News, on internal discussions. And the question gets asked: why can’t we support foreign keys?

And so this post explains why, technically, there’s an inherent problem in supporting foreign keys in Online Schema Changes. This is not about opinions for or against foreign keys.

Wait! pt-online-schema-change does support foreign keys! There’s command line flags for that!

Yes, no. Not quite, and I’ll elaborate as we dive into the details. And, to clarify, pt-online-schema-change attempts to make the best of the situation. Back when developing gh-ost, we saw that as a non-feasible solution. pt-online-schema-change does a good job at explaining the restrictions and limitations of its foreign key support, and we will cover these and beyond, here.

OK, let’s dive in.

We begin, let’s first present a model

Consider the following extremely simplified model. Don’t judge me on the oversimplification, we just want to address the foreign keys issue here.

CREATE TABLE country (
id INT NOT NULL,
name VARCHAR(255) NOT NULL,
PRIMARY KEY (id)
);

CREATE TABLE person (
id INT NOT NULL,
country_id INT NOT NULL,
name VARCHAR(255) NOT NULL,
PRIMARY KEY(id),
KEY country_idx (country_id),
CONSTRAINT person_country_fk FOREIGN KEY (country_id) REFERENCES country(id) ON DELETE NO ACTION
);

CREATE TABLE company (
id INT NOT NULL,
country_id INT NOT NULL,
name VARCHAR(255) NOT NULL,
PRIMARY KEY(id),
KEY country_idx (country_id),
CONSTRAINT company_country_fk FOREIGN KEY (country_id) REFERENCES country(id) ON DELETE NO ACTION
);

Some analysis, rules and facts

In the above we have 3 tables participating in two foreign key relationship.
- We will add a 4th one later
- country is a parent table in both relationship
- person is a child table in relationship with country
- company is a child table in relationship with country
Let’s assume/agree that country is a small table (maybe a couple hundred rows), and that both person and company are large tables (just, large enough to be a problem)
MySQL doesn’t support foreign keys, per se. At this time, foreign keys are implemented by the storage engine, which is InnoDB in our case. This matters. I just dug this post from 2009, quote:
> MySQL’s plan is to add foreign keys for all storage engines. The plan is on print for quite a few years now.
This didn’t happen, external foreign keys do not exist.
Why does this matter? Because a foreign key in InnoDB is coupled with a table. There’s a space where the foreign key exists, and that space is a table. It matters because adding or dropping a foreign key is done by an ALTER TABLE statement. This is where the turtles begin to pile up.
Foreign keys don’t associate to tables by name but by identity. If you RENAME a parent table, for example, than children’s foreign keys follow the table under its new name. This is where our pillar of turtles becomes higher.
I chose NO ACTION (aka RESTRICT), but it doesn’t really matter to our discussion.
MySQL allows you to disable foreign key checks for your session via SET FOREIGN_KEY_CHECKS=0
You can disable foreign key checks globally via SET GLOBAL FOREIGN_KEY_CHECKS=0, but this does not affect existing sessions, only ones created after your statement.
All Online Schema Change tools: gh-ost, fb-osc, pt-online-schema-change ,LHM, and Vitess’s VReplication, work by creating a “shadow” table, which I like to call the ghost table.
- They create that table in the likeness of the original table.
- They modify the ghost table, and slowly populate it with data from the original table.
- At the end of the operation, in slightly different techniques, they RENAME the original table away, e.g. to _mytable_old, and RENAME the ghost table in its place, at which time it assumes production traffic.
- This is where the pile of turtles begins to shake.

Changing a child table

Say we want to ALTER TABLE person MODIFY name VARCHAR(1024) NOT NULL CHARSET utf8mb4. Or add a column. Or an index. Whichever. Let’s see what happens.

person has a foreign key. We therefore create the ghost table with similar foreign key, a child table that references the parent country table. Funnily, even though InnoDB’s foreign keys live inside a table scope, their names are globally unique. So we create the ghost table as follows:

CREATE TABLE _person_ghost (
id INT NOT NULL,
country_id INT NOT NULL,
name VARCHAR(255) NOT NULL,
PRIMARY KEY(id),
KEY country_idx (country_id),
CONSTRAINT person_country_fk2 FOREIGN KEY (country_id) REFERENCES country(id) ON DELETE NO ACTION
);

Notice the name of the constraint changes to `person_country_fk2`.
Because `_person_ghost` is a child-only table, there’s no problem with it being empty.
pt-online-schema-change is based on synchronous, same-transaction, data copy via triggers. At any point in time, if we populate _person_ghost with a row, that row also exists in the original person table during that same transaction. This means the data we insert to _person_ghost is foreign key safe.
gh-ost, fb-osc, Vitess use an asynchronous approach where they tail either the binary logs or a changelog table. It is possible that as we INSERT data to _person_ghost, that data no longer exists in person. It is possible that there’s no matching entry in country! We can overcome that by disabling foreign key checks on our session/connection that populates the ghost table. We run SET FOREIGN_KEY_CHECKS=0 as make the server (and our users!) a promise, that even while populating the table there may be inconsistencies, we’ll figure it all out at time of cut-over.
Finally, population is complete. We place whatever locks we need to, ensure everything is in sync, and swap _person_ghost in place of person.

ERROR!

What have ended up with? Take a look:

The table person_OLD still exists, and maintains a foreign key constraint on country. Now, suppose we want to delete country number 99. We delete or update all rows in person which point to country 99. Good. We proceed to DELETE FROM country WHERE id=99. We can’t. That’s because person_OLD still has rows where country_id=99.

Well, why don’t you just drop that old constraint?

To drop the foreign key constraint from person_old is to ALTER TABLE person_old DROP FOREIGN KEY person_country_fk. What’s that? An ALTER TABLE? Wasn’t that the thing we wanted to avoid in the first place? There was a reason we ran an online schema change! So that’s an absolute no go.

Well, why don’t you just drop the old table?

pt-online-schema-change offers --alter-foreign-keys-method drop_swap: to get rid of the foreign key we can drop the old table. The logic it offers is:

Before we cut-over
Disable foreign key checks
DROP the original table (e.g. person)
RENAME the ghost table in its place

Problem: DROP

Alas, more turtles. Dropping a MySQL table is production is a cause for outage. Here’s a lengthy discussion form the gh-ost repo. Digging my notes shows this post from 2010. This is an ancient problem where dropping a table places locks on buffer pool and on adaptive hash index, and there’s been multiple attempts to work around it. See Vitess’s table lifecycle for more.

Just a couple months ago, MySQL 8.0.23 release notes indicate that this bug is finally solved. I can’t wait to try it out. Most of the world is not on 8.0.23 yet and until it is, DROP is a problem.

In my personal experience, if you can’t afford to run a straight ALTER on a table, it’s likely you can’t afford to DROP it.

Problem: outage

As pt-online-schema-change documentation correctly point out, we cause a brief time of outage after we DROP the person table, and before we RENAME TABLE _person_ghost TO person. This is unfortunate, but, assuming DROP is instantaneous, is indeed brief.

Child-side: summary

Assuming MySQL 8.0.23 with instantaneous DROP, altering a table with child-side-only constraint is feasible. Without instantaneous DROP, the migration can be as blocking as a straight ALTER.

I regret to inform that from here things only get worse.

Changing a parent table

What happens if we naively try to ALTER TABLE country ADD COLUMN currency VARCHAR(16) NOT NULL?

We create a ghost table, we populate the ghost table, we cut-over, and… End up with:

Our naive approach fails miserably. As we RENAME TABLE country to country_OLD, the children’s foreign keys, on person and company, followed the table entity into country_OLD. We are now in a situation where there is no active constraint on country, and we’re stuck with a legacy table that affects our production.

Just drop the old table?

Other than the DROP issue discussed above, this doesn’t solve the main problem, which is that we are left with no constraint on country.

ALTER on parent implies ALTER on children

The shocking result of our naive experiment, is that if we want to ALTER TABLE country, we must – concurrently somehow – also ALTER TABLE person and – concurrently somehow – ALTER TABLE company. On the children tables we need to DROP the old foreign key, and create a new foreign key that points into country_ghost.

That’s a lot to unpack.

How does pt-online-schema-change solve this?

pt-online-schema-change offers --alter-foreign-keys-method rebuild_constraints. In this method, just before we cut-over and RENAME the tables, we iterate all children, and , one by one, run a straight ALTER TABLE on each of the children to DROP the old constraint and to ADD the new constraint, pointing to country_ghost (imminently? to be renamed to country).

This must happen when the ghost table is in full sync with the original table, or else there can be violations. For pt-online-schema-change, which uses synchronous in-transaction trigger propagation, this works. For gh-ost, Vitess etc., which use the asynchronous approach, this can only take place while we place a write lock on the original table.

As pt-online-schema-change documentation correctly indicates, this makes sense only when the children are all very small tables.

This gets worse. Let’s break this down even more.

Straight ALTER on children, best case scenario?

Best case is achieved when indeed all children tables are very small. Still, we need to place a lock, and either sequentially or concurrently ALTER multiple such small tables.

In my experience, on databases that aren’t trivially small, the opposite is more common: children tables are much larger than parent tables, and running a straight ALTER on children is just not feasible.

Straight ALTER on children, failures?

Even the best case scenario poses the complexity of recovering/rolling back from error. For example, in a normal online schema change, we set timeouts for DDLs. Like the final RENAME. If something doesn’t work out, we timeout the DDL, take a step back, and try cutting-over again later on. But our situation is much more complex now. While we keep a write lock, we must run multiple DDLs on the children, repointing their foreign keys from the original country table to country_ghost. What if one of those DDLs fail? We are left in a limbo state. Some of the DDLs may have succeeded. We’d need to either revert them, introducing even more DDLs (remember, we’re still holding locks), or retry that failing DDL. Those are a lot of DDLs to synchronize at the same time, even when they’re at all feasible.

If children tables are large?

In our scenario, person and company are large tables. A straight ALTER table is just not feasible. We began this discussion assuming there’s a problem with ALTER in the first place.

Also, for asynchronous online schema changes the situation is much more complex since we need to place more locks.

So, let’s ALTER the children with Online Schema Change?

There’s an alluring thought. We bite, and illustrate what it would take to run an online schema change on each of the large children, concurrently to, and coordinated with, an online schema change on the parent.

When can we start OSC on children?

We want the children to point their FK to country_ghost. So we must kick the migration on each child after the parent’s migration creates the ghost table, and certainly before cut-over.

Initially, the parent’s ghost table is empty, or barely populated. Isn’t that a problem? Pointing to a parent table which is not even populated? Fortunately for us, we again remember we can disable foreign key checks as our OSC tool populates the child table. Sure, everything is broken at first, but we promise the server and the user that we will figure it all out at cut-over time.

So far, looks like we have a plan. We need to catch that notification that country_ghost table is created, and we kick an online migration on person and on company.

When do we cut-over each migration?

We absolutely can’t cut-over country before person and company are complete. That’s why we embarked on altering the children in the first place. We must have the children’s foreign keys point to country_ghost before cutting it over.

But now, we need to also consider: when is it safe to cut-over person and company? It is only safe to cut-over when referential integrity is guaranteed. We remember that throughout the parent’s migration there’s no such guarantee. surely not while the table gets populated. And for asynchronous-based migrations, even after that, because the ghost table always “lags” a bit behind the original table.

The only way to provide referential integrity guarantee for asynchronous based migrations is when we place a write lock on the parent table (country). We bite. We lock the table for writes, and sync up country_ghost until we’re satisfied both are in complete sync. Now’s logically a safe time to cut-over the children.

But notice: this is a single, unique time, where we must cut-over all children, or none. This gets worse.

Best case scenario for cutting-over

In the best scenario, we place a lock on country, sync up country_ghost, hold the lock, then iterate all children, and cut-over each. All children operations are successful. We cut-over the parent.

But this best case scenario depends on getting the best case scenario on each of the children, to its own. Remember, an ALTER on a child table means we have to DROP the child’s old table. Recall the impact it has in production. Now multiply by n children. The ALTER on country, and while holding a write lock, will need to sustain survive DROP on both person_OLD and company_OLD. This ie best case.

Less than best case scenario is a disaster

We don’t have the room for problems. Suppose person cuts over, and we DROP person_OLD. But then company fails to cut-over. There’s DDL timeout.

We can’t roll back. person is now committed to company_ghost. We can try cutting over company again, and again, and again. But we may not fail. During these recurring attempts we must keep the lock on country. And try again company. Did it succeed? Phew. We can cut-over country and finally remove the lock.

But what if something really fails? Pro tip: it most certainly happens.

If person made it, and company does not – if company‘s migration breaks, fails, panics, gets killed, goes into seemingly infinite deadlocks, is unable to cut-over — whichever — we’re left in inconsistent and impossible scenario. person is committed to company_ghost, but company is still committed to country. We have to keep that lock on country and run a new migration on company! and again, and again. Meanwhile, country is locked. Oh yes, meanwhile person is also locked. You can’t write to person because you can’t verify that related rows exist in country, because country has a WRITE lock.

I can’t stress this enough: the lock must not be released until all children tables are migrated. So, for our next turtle, what happens on a failover? We get referential integrity corruption, because locks don’t work across servers.

Disk space

Remember that an OSC works by creating a ghost table and populating it until it is in sync with the original table. This effectively means requiring extra disk space at roughly the same volume as the original table.

In a perfect world, we’d have all the disk space we ever needed. In my experience we’re far from living in a perfect world. I’ve had migrations where we weren’t sure we had the disk space for a single table change.

If we are to ALTER a parent, and as by product ALTER all of its children, at the same time, we’d need enough free disk space for all volumes of affected tables, combined.

In fact, running out of disk space is one of the common reasons for failing an online schema change operation. Consider how low the tolerance is for parent-side schema migration errors. Consider that running out of disk space isn’t something that just gets solved by retrying the cut-over again, and again, … the disk space is not there.

Run time

Three migrations running concurrently will not run faster than three migrations running sequentially — that’s my experience backed with production experiments. In my experience they actually end up taking longer because they’re all fighting for same resources, and context switch matters, as back-off intervals pile up. Maybe there’s some scenario where they could run slightly faster?

Altering our 200 row country table ends up taking hours and hours due to the large person and country tables. The time for a migration is roughly the sum of times for all dependent migrations!

Hmmm. Maybe on country we should just run a straight ALTER. I think so, that wins! But it only wins our particular scenario, as we see next.

Parent-side: summary

The operational complexity of Online Schema Changes for parent-side foreign keys is IMO not feasible. We need to assume all child-side operations are feasible, first (I’m looking at you, DROP TABLE), and we have almost zero tolerance to things going wrong. Coordinating multiple migrations is complex, and a failover at the wrong time may cause corruption

Changing a deep nested relationship

Truly, everything discussed thus far was a simplified situation. We introduce more turtles to our story. Let’s add this table:

CREATE TABLE person_company (
id INT NOT NULL AUTO_INCREMENT,
person_id INT NOT NULL,
company_id INT NOT NULL,
start_at TIMESTAMP NOT NULL,
end_at TIMESTAMP NULL,
PRIMARY KEY(id),
KEY person_idx (person_id),
KEY company_idx (company_id),
CONSTRAINT person_company_person_fk FOREIGN KEY (person_id) REFERENCES person(id) ON DELETE NO ACTION,
CONSTRAINT person_company_company_fk FOREIGN KEY (company_id) REFERENCES company(id) ON DELETE NO ACTION
);

person_company is a child of person and of company. It’s actually enough that it’s a child of one of them. What’s important is that now person is both a child table and a parent table. So is company. This is a pretty common scenario in schema designs.

How do you `ALTER` a table that is both a parent and a child?

We introduce no new logic here, we “just” have to combine the logic for both. Given person_company exists, if we wanted to ALTER TABLE person we’d need to:

Alter person as a child table (implies DROP issue and outage)
Alter person as a parent (implies altering person_company and synchronizing the cut-over)

So how do we alter country now?

To ALTER TABLE country, we’d need to:

Begin country OSC, wait till country_ghost is created
Then, begin person OSC, wait till person_ghost is created, and
begin company OSC, wait till company_ghost is created
Then, begin OSC on person_company
Run until all of the migrations seem to be ready to cut-over
Place lock on country. while this lock is in place:
- Sync up person migration. Place lock on person, and
- Sync up company migration. Place lock on company.
- While both locks are in place:
  - Sync up person_company.
  - DROP person_company_OLD!
  - Cut-over person_company!
- DROP company_OLD!
- Cut-over company!
- DROP person_OLD!
- Cut-over person!
Cut-over country!

And we have near zero tolerance to any failure in the above, and we can’t afford a failover during that time…

Overall summary

It would all be better if we could just run ALTER TABLE in MySQL and have it truly online, throttling, and on replicas, too. This doesn’t exists and our alternative is mostly Online Schema change tools, where, IMO, handing foreign key constraints on large tables is not feasible.

There’s an alternative to Online Schema change, which is to ALTER on replicas. That comes with its own set of problems, and for this blog post I just ran out of fumes. For another time!

orchestrator on DB AMA: show notes

2020-05-26T17:52:52Z

Earlier today I presented orchestrator on DB AMA. Thank you to the organizers Morgan Tocker, Liz van Dijk and Frédéric Descamps for hosting me, and thank you to all who participated!

This was a no-slides, all command-line walkthrough of some of orchestrator‘s capabilities, highlighting refactoring, topology analysis, takeovers and failovers, and discussing a bit of scripting and HTTP API tips.

The recording is available on YouTube (also embedded on https://dbama.now.sh/#history).

To present orchestrator, I used the new shiny docker CI environment; it’s a single docker image running orchestrator, a 4-node MySQL replication topology (courtesy dbdeployer), heartbeat injection, Consul, consul-template and HAProxy. You can run it, too! Just clone the orchestrator repo, then run:

./script/dock system

From there, you may follow the same playbook I used in the presentation, available as orchestrator-demo-playbook.sh.

Hope you find the presentation and the playbook to be useful resources.

orchestrator: what’s new in CI, testing & development

2020-05-12T07:13:37Z

Recent focus on development & testing yielded with new orchestrator environments and offerings for developers and with increased reliability and trust. This post illustrates the new changes, and see Developers section on the official documentation for more details.

Testing

In the past four years orchestrator was developed at GitHub, and using GitHub’s environments for testing. This is very useful for testing orchestrator‘s behavior within GitHub, interacting with its internal infrastructure, and validating failover behavior in a production environment. These tests and their results are not visible to the public, though.

Now that orchestrator is developed outside GitHub (that is, outside GitHub the company, not GitHub the platform) I wanted to improve on the testing framework, making it visible, accessible and contribute-able to the community. Thankfully, the GitHub platform has much to offer on that front and orchestrator now uses GitHub Actions more heavily for testing.

GitHub Actions provide a way to run code in a container in the context of the repository. The most common use case is to run CI tests on receiving a Pull Request. Indeed, when GitHub Actions became available, we switched out of Travis CI and into Actions for orchestrator‘s CI.

Today, orchestrator runs three different tests:

Build, unit testing, integration testing, code & doc validation
Upgrade testing
System testing

To highlight what each does:

Build, unit testing, integration testing

Based on the original CI (and possibly will split into distinct tests), this CI Action compiles the code, runs unit tests, runs the suite of integration tests (spins up both MySQL and SQLite databases and runs a series of tests on each backend), this CI job is the “basic” test to see that the contributed code even makes sense.

What’s new in this test is that it now produces an artifact: an orchestrator binary for Linux/amd64. This is again a feature for GitHub Actions; the artifact is kept for a couple months or so per Actions retention policy. Here‘s an example; by the time you read this the binary artifact may or may not still be there.

This means you don’t actually need a development environment on your laptop to be able to build and orchestrator binary. More on this later.

Upgrade testing

Until recently not formalized; I’d test upgrades by deploying them internally at GitHub onto a staging environment. Now upgrades are tested per Pull Request: we spin up a container, deploy orchestrator from master branch using both MySQL and SQLite backends, then checkout the PR branch, and redeploy orchestrator using the existing backends — this verifies that at least backend-database wise, there’s not upgrade errors.

At this time the test only validates the database changes are applicable; in the future this may expand onto more elaborate tests.

System testing

I’m most excited about this one. Taking ideas from our approach to testing gh-ost with dbdeployer, I created https://github.com/openark/orchestrator-ci-env, which offers a full blown testing enviroment for orchestrator, including a MySQL replication topology (courtesy dbdeployer), Consul, HAProxy and more.

This CI testing environment can also serve as a playground in your local docker setup, see shortly.

The system tests suite offers full blown cluster-wide operations such as graceful takeovers, master failovers, errant GTID transaction analysis and recovery and more. The suite utilizes the CI testing environment, breaks it, rebuilds it, validates it… Expects specific output, expects specific failure messages, specific analysis, specific outcomes.

As example, with the system tests suite, we can test the behavior of a master failover in a multi-DC, multi-region (obviously simulated) environment, where a server marked as “candidate” is lagging behind all others, with strict rules for cross-site/cross-region failovers, and still we wish to see that particular replica get promoted as master. We can test not only the topology aspect of the failover, but also the failover hooks, Consul integration and its effects, etc.

Development

There’s now multiple options for developers/contributors to build or just try out orchestrator.

Build on GitHub

As mentioned earlier, you actually don’t need a development environment. You can use orchestrator CI to build and generate a Linux/amd64 orchestrator binary, which you can download & deploy as you see fit.

I’ve signed up for the GitHub Codespaces beta program, and hope to make that available for orchestrator, as well.

Build via Docker

orchestrator offers various Docker build/run environments, accessible via the script/dock script:

`script/dock alpine` will build and spawn `orchestrator` on a minimal alpine linux
`script/dock test` will build and run the same CI tests (unit, integration) as mentioned earlier, but on your own docker environemtn
`script/dock pkg` will build and generate `.rpm` and `.deb` packages

CI environment: the “full orchestrator experience”

This is the orchestrator amusement park. Run script/dock system to spawn the aforementioned CI environment used in system tests, and on top of that, an orchestrator setup fully integrated with that system.

So that’s an orchestrator-MySQL topology-Consul-HAProxy setup, where orchestrator already has the credentials for, and pre-loads the MySQL topology, pre-configured to update Consul upon failover, HAProxy config populated by consul-template, heartbeat injection, and more. It resembles the HA setup at GitHub, and in the future I expect to provide alternate setups (on top).

Once in that docker environment, one can try running relocations, failovers, test orchestrator‘s behavior, etc.

Community

GitHub recently announced GitHub Discussions ; think a stackoverflow like place within one’s repo to ask questions, discuss, vote on answers. It’s expected to be available this summer. When it does, I’ll encourage the community to use it instead of today’s orchestrator-mysql Google Group and of course the many questions posted as Issues.

There’s been a bunch of PRs merged recently, with more to come later on. I’m grateful for all contributions. Please understand if I’m still slow to respond.

Pulling this blog out of Planet MySQL aggregator, over community concerns

2020-05-10T08:13:45Z

I’ve decided to pull this blog (http://code.openark.org/blog/) out of the planet.mysql.com aggregator.

planet.mysql.com (formerly planetmysql.com) serves as a blog aggregator, and collects news and blog posts on various MySQL and its ecosystem topics. It collects some vendor and team blogs as well as “indie” blogs such as this one.

It has traditionally been the go-to place to catch up on the latest developments, or to read insightful posts. This blog itself has been aggregated in Planet MySQL for some eleven years.

Planet MySQL used to be owned by the MySQL community team. This recently changed with unwelcoming implications for the community.

I recently noticed how a blog post of mine, The state of Orchestrator, 2020 (spoiler: healthy), did not get aggregated in Planet MySQL. After a quick discussion and investigation, it was determined (and confirmed) it was filtered out because it contained the word “MariaDB”. It has later been confirmed that Planet MySQL now filters out posts indicating its competitors, such as MariaDB, PostgreSQL, MongoDB.

Planet MySQL is owned by Oracle and it is their decision to make. Yes, logic implies they would not want to publish a promotional post for a competitor. However, I wish to explain how this blind filtering negatively affects the community.

But, before that, I’d like to share that I first attempted to reach out to whoever is in charge of Planet MySQL at this time (my understanding is that this is a marketing team). Sadly, two attempts at reaching out to them individually, and another attempt at reaching out on behalf of a small group of individual contributors, yielded no response. The owners would not have audience with me, and would not hear me out. I find it disappointing and will let others draw morals.

Why filtering is harmful for the community

We recognize that planet.mysql.com is an important information feed. It is responsible for a massive ratio of the traffic on my blog, and no doubt for many others. Indie blog posts, or small-team blog posts, practically depend on planet.mysql.com to get visibility.

And this is particularly important if you’re an open source developer who is trying to promote an open source project in the MySQL ecosystem. Without this aggregation, you will get significantly less visibility.

But, open source projects in the MySQL ecosystem do not live in MySQL vacuum, and typically target/support MySQL, Percona Server and MariaDB. As examples:

DBDeployer should understand MariaDB versioning scheme
skeema needs to recognize MariaDB features not present in MySQL
ProxySQL needs to support MariaDB Galera queries
orchestrator needs to support MariaDB’s GTID flavor

Consider that a blog post of the form “Project version 1.2.3 now released!” is likely to mention things like “fixed MariaDB GTID setup” or “MariaDB 10.x now supported” etc. Consider just pointing out that “PROJECT X supports MySQL, MariaDB and Percona Server”.

Consider that merely mentioning “MariaDB” gets your blog post filtered out on planet.mysql.com. This has an actual impact on open source development in the MySQL ecosystem. We will lose audience and lose adoption.

I believe the MySQL ecosystem as a whole will be negatively affected as result, and this will circle back to MySQL itself. I believe this goes against the very interests of Oracle/MySQL.

I’ve been around the MySQL community for some 12 years now. From my observation, there is no doubt that MySQL would not thrive as it does today, without the tooling, blogs, presentations and general advice by the community.

This is more than an estimation. I happen to know that, internally at MySQL, they have used or are using open source projects from the community, projects whose blog posts get filtered out today because they mention “MariaDB”. I find that disappointing.

I have personally witnessed how open source developments broke existing barriers to enable companies to use MySQL at greater scale, in greater velocity, with greater stability. I was part of such companies and I’ve personally authored such tools. I’m disappointed that planet.mysql.com filters out my blog posts for those tools and without giving me audience, and extend my disappointment for all open source project maintainers.

At this time I consider planet.mysql.com to be a marketing blog, not a community feed, and do not want to participate in its biased aggregation.

The state of Orchestrator, 2020 (spoiler: healthy)

2020-02-18T19:48:50Z

This post serves as a pointer to my previous announcement about The state of Orchestrator, 2020.

Thank you to Tom Krouper who applied his operational engineer expertise to content publishing problems.

The state of Orchestrator, 2020 (spoiler: healthy)

2020-02-18T08:09:05Z

Yesterday was my last day at GitHub, and this post explains what this means for orchestrator. First, a quick historical review:

2014: I began work on orchestrator at Outbrain, as https://github.com/outbrain/orchestrator. I authored several open source projects while working for Outbrain, and created orchestrator to solve discovery, visualization and simple refactoring needs. Outbrain was happy to have the project developed as a public, open source repo from day 1, and it was released under the Apache 2 license. Interestingly, the idea to develop orchestrator came after I attended Percona Live Santa Clara 2014 and watched “ChatOps: How GitHub Manages MySQL” by one Sam Lambert.
2015: Joined Booking.com where my main focus was to redesign and solve issues with the existing high availability setup. With Booking.com’s support, I continued work on orchestrator, pursuing better failure detection and recovery processes. Booking.com was an incredible playground and testbed for orchestrator, a massive deployment of multiple MySQL/MariaDB flavors and configuration.
2016 – 2020: Joined GitHub. GitHub adopted orchestrator and I developed it under GitHub’s own org, at https://github.com/github/orchestrator. It became a core component in github.com’s high availability design, running failure detection and recoveries across sites and geographical regions, with more to come. These 4+ years have been critical to orchestrator‘s development and saw its widespread use. At this time I’m aware of multiple large-scale organizations using orchestrator for high availability and failovers. Some of these are GitHub, Booking.com, Shopify, Slack, Wix, Outbrain, and more. orchestrator is the underlying failover mechanism for vitess, and is also included in Percona’s PMM. These years saw a significant increase in community adoption and contributions, in published content, such as Pythian and Percona technical blog posts, and, not surprisingly, increase in issues and feature requests.

2020

GitHub was very kind to support moving the orchestrator repo under my own https://github.com/openark org. This means all issues, pull requests, releases, forks, stars and watchers have automatically transferred to the new location: https://github.com/openark/orchestrator. The old links do a “follow me” and implicitly direct to the new location. All external links to code and docs still work. I’m grateful to GitHub for supporting this transfer.

I’d like to thank all the above companies for their support of orchestrator and of open source in general. Being able to work on the same product throughout three different companies is mind blowing and an incredible opportunity. orchestrator of course remains open source and licensed with Apache 2. Existing Copyrights are unchanged.

As for what’s next: some personal time off, please understand if there’s delays to reviews/answers. My intention is to continue developing orchestrator. Naturally, the shape of future development depends on how orchestrator meets my future work. Nothing changes in that respect: my focus on orchestrator has always been first and foremost the pressing business needs, and then community support as possible. There are some interesting ideas by prominent orchestrator users and adopters and I’ll share more thoughts in due time.

Quick hack for GTID_OWN lack

2019-12-11T08:00:00Z

One of the benefits of MySQL GTIDs is that each server remembers all GTID entries ever executed. Normally these would be ranges, e.g. 0041e600-f1be-11e9-9759-a0369f9435dc:1-3772242 or multi-ranges, e.g. 24a83cd3-e30c-11e9-b43d-121b89fcdde6:1-103775793, 2efbcca6-7ee1-11e8-b2d2-0270c2ed2e5a:1-356487160, 46346470-6561-11e9-9ab7-12aaa4484802:1-26301153, 757fdf0d-740e-11e8-b3f2-0a474bcf1734:1-192371670, d2f5e585-62f5-11e9-82a5-a0369f0ed504:1-10047.

One of the common problems in asynchronous replication is the issue of consistent reads. I’ve just written to the master. Is the data available on a replica yet? We have iterated on this, from reading on master, to heuristically finding up-to-date replicas based on heartbeats (see presentation and slides) via freno, and now settled, on some parts of our apps, to using GTID.

GTIDs are reliable as any replica can give you a definitive answer to the question: have you applied a given transaction or not?. Given a GTID entry, say f7b781a9-cbbd-11e9-affb-008cfa542442:12345, one may query for the following on a replica:

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:12345', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 1 |
+-------------------+

mysql> select gtid_subset('f7b781a9-cbbd-11e9-affb-008cfa542442:123450000', @@global.gtid_executed) as transaction_found;
+-------------------+
| transaction_found |
+-------------------+
|                 0 |
+-------------------+

Getting OWN_GTID

This is all well, but, given some INSERT or UPDATE on the master, how can I tell what’s the GTID associated with that transaction? There\s good news and bad news.

Good news is, you may SET SESSION session_track_gtids = OWN_GTID. This makes the MySQL protocol return the GTID generated by your transaction.
Bad news is, this isn’t a standard SQL response, and the common MySQL drivers offer you no way to get that information!

At GitHub we author our own Ruby driver, and have implemented the functionality to extract OWN_GTID, much like you’d extract LAST_INSERT_ID. But, how does one solve that without modifying the drivers? Here’s a poor person’s solution which gives you an inexact, but good enough, info. Following a write (insert, delete, create, …), run:

select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;

The idea is to “clean” the executed GTID set from irrelevant entries, by filtering out all ranges that do not belong to the server you’ve just written to (the master). The number 1000000000000000 stands for “high enough value that will never be reached in practice” – set to your own preferred value, but this value should take you beyond 300 years assuming 100,000 transactions per second.

The value you get is the range on the master itself. e.g.:

mysql> select gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), gtid_subtract(concat(@@server_uuid, ':1-1000000000000000'), @@global.gtid_executed)) as master_generated_gtid;
+-------------------------------------------------+
| master_generated_gtid                           |
+-------------------------------------------------+
| dc103953-1598-11ea-82a7-008cfa5440e4:1-35807176 |
+-------------------------------------------------+

You may further parse the above to extract dc103953-1598-11ea-82a7-008cfa5440e4:35807176 if you want to hold on to the latest GTID entry. Now, this entry isn’t necessarily your own. Between the time of your write and the time of your GTID query, other writes will have taken place. But the entry you get is either your own or a later one. If you can find that entry on a replica, that means your write is included on the replica.

One may wonder, why do we need to extract the value at all? Why not just select @@global.gtid_executed? Why filter only the master‘s UUID? Logically, the answer is the same if you do that. But in practice, your query may be unfortunate enough to return some:

select @@global.gtid_executed \G

e71f0cdb-b8ef-11e9-9361-008cfa542442:1-83331,
e742d87f-dea7-11e9-be6d-008cfa542c9e:1-18485,
e7880c0e-ac54-11e9-865a-008cfa544064:1-7331973,
e82043c6-c7d9-11e9-9413-008cfa5440e4:1-61692,
e902678b-b046-11e9-a281-008cfa542c9e:1-83108,
e90d7ff9-e35e-11e9-a9a0-008cfa544064:1-18468,
e929a635-bb40-11e9-9c0d-008cfa5440e4:1-139348,
e9351610-ef1b-11e9-9db4-008cfa5440e4:1-33460918,
e938578d-dc41-11e9-9696-008cfa542442:1-18232,
e947f165-cd53-11e9-b7a1-008cfa5440e4:1-18480,
e9733f37-d537-11e9-8604-008cfa5440e4:1-18396,
e97a0659-e423-11e9-8433-008cfa542442:1-18237,
e98dc1f7-e0f8-11e9-9bbd-008cfa542c9e:1-18482,
ea16027a-d20e-11e9-9845-008cfa542442:1-18098,
ea1e1aa6-e74a-11e9-a7f2-008cfa544064:1-18450,
ea8bc1bd-dd06-11e9-a10c-008cfa542442:1-18203,
eae8c750-aaca-11e9-b17c-008cfa544064:1-85990,
eb1e41e9-af81-11e9-9ceb-008cfa544064:1-86220,
eb3c9b3b-b698-11e9-b67a-008cfa544064:1-18687,
ec6daf7e-b297-11e9-a8a0-008cfa542c9e:1-80652,
eca4af92-c965-11e9-a1f3-008cfa542c9e:1-18333,
ecd110b9-9647-11e9-a48f-008cfa544064:1-24213,
ed26890e-b10b-11e9-a79d-008cfa542c9e:1-83450,
ed92b3bf-c8a0-11e9-8612-008cfa542442:1-18223,
eeb60c82-9a3d-11e9-9ea5-008cfa544064:1-1943152,
eee43e06-c25d-11e9-ba23-008cfa542442:1-105102,
eef4a7fb-b438-11e9-8d4b-008cfa5440e4:1-74717,
eefdbd3b-95b3-11e9-833d-008cfa544064:1-39415,
ef087062-ba7b-11e9-92de-008cfa5440e4:1-9726172,
ef507ff0-98b3-11e9-8b15-008cfa5440e4:1-928030,
ef662471-9a3b-11e9-bd2e-008cfa542c9e:1-954800,
f002e9f7-97ee-11e9-bed0-008cfa542c9e:1-5180743,
f0233228-e9a1-11e9-a142-008cfa542c9e:1-18583,
f04780c4-a864-11e9-9f28-008cfa542c9e:1-83609,
f048acd9-b1d2-11e9-a0b6-008cfa544064:1-70663,
f0573d8c-9978-11e9-9f73-008cfa542c9e:1-85642135,
f0b0a37c-c89c-11e9-804c-008cfa5440e4:1-18488,
f0cfe1ac-e5af-11e9-bc09-008cfa542c9e:1-18552,
f0e4997c-cbc9-11e9-9179-008cfa542442:1-1655552,
f24e481c-b5c4-11e9-aff0-008cfa5440e4:1-83015,
f4578c4b-be6d-11e9-982e-008cfa5440e4:1-132701,
f48bce80-e99f-11e9-94f4-a0369f9432f4:1-18460,
f491adf1-9b04-11e9-bc71-008cfa542c9e:1-962823,
f5d3db74-a929-11e9-90e8-008cfa5440e4:1-75379,
f6696ba7-b750-11e9-b458-008cfa542c9e:1-83096,
f714cb4c-dab7-11e9-adb9-008cfa544064:1-18413,
f7b781a9-cbbd-11e9-affb-008cfa542442:1-18169,
f81f7729-b10d-11e9-b29b-008cfa542442:1-86820,
f88a3298-e903-11e9-88d0-a0369f9432f4:1-18548,
f9467b29-d78c-11e9-b1a2-008cfa5440e4:1-18492,
f9c08f5c-e4ea-11e9-a76c-008cfa544064:1-1667611,
fa633abf-cee3-11e9-9346-008cfa542442:1-18361,
fa8b0e64-bb42-11e9-9913-008cfa542442:1-140089,
fa92234c-cc90-11e9-b337-008cfa544064:1-18324,
fa9755eb-e425-11e9-907d-008cfa542c9e:1-1668270,
fb7843d5-eb38-11e9-a1ff-a0369f9432f4:1-1668957,
fb8ceae5-dd08-11e9-9ed3-008cfa5440e4:1-18526,
fbf9970e-bc07-11e9-9e4f-008cfa5440e4:1-136157,
fc0ffaee-98b1-11e9-8574-008cfa542c9e:1-940999,
fc9bf1e4-ee54-11e9-9ce9-008cfa542c9e:1-18189,
fca4672f-ac56-11e9-8a83-008cfa542442:1-82014,
fcebaa05-dab5-11e9-8356-008cfa542c9e:1-18490,
fd0c88b1-ad1b-11e9-bf3a-008cfa5440e4:1-75167,
fd394feb-e4e4-11e9-bd09-008cfa5440e4:1-18574,
fd687577-b048-11e9-b429-008cfa542442:1-83479,
fdb18995-a79f-11e9-a28d-008cfa542442:1-82351,
fdc72b7f-b696-11e9-ade9-008cfa544064:1-57674,
ff1f3b6b-c967-11e9-ae04-008cfa544064:1-18503,
ff6fe7dc-c186-11e9-9bb4-008cfa5440e4:1-103192,
fff9dd94-ed95-11e9-90b7-008cfa544064:1-911039

This can happen when you fail over to a new master, multiple times; it happens when you don’t recycle UUIDs, when you provision new hosts and let MySQL pick their UUID. Returning this amount of data per query is an excessive overhead, hence why we extract the master‘s UUID only, which is guaranteed to be limited in size.

Un-split brain MySQL via gh-mysql-rewind

2019-03-05T14:04:32Z

We are pleased to release gh-mysql-rewind, a tool that allows us to move MySQL back in time, automatically identify and rewind split brain changes, restoring a split brain server into a healthy replication chain.

I recently had the pleasure of presenting gh-mysql-rewind at FOSDEM. Video and slides are available. Consider following along with the video.

Motivation

Consider a split brain scenario: a “standard” MySQL replication topology suffered network isolation, and one of the replicas was promoted as new master. Meanwhile, the old master was still receiving writes from co-located apps.

Once the network isolation is over, we have a new master and an old master, and a split-brain situation: some writes only took place on one master; others only took place on the other. What if we wanted to converge the two? What paths do we have to, say, restore the old, demoted master, as a replica of the newly promoted master?

The old master is unlikely to agree to replicate from the new master. Changes have been made. AUTO_INCREMENT values have been taken. UNIQUE constraints will fail.

A few months ago, we at GitHub had exactly this scenario. An entire data center went network isolated. Automation failed over to a 2nd DC. Masters in the isolated DC meanwhile kept receiving writes. At the end of the failover we ended up with a split brain scenario – which we expected. However, an additional, unexpected constraint forced us to fail back to the original DC.

We had to make a choice: we’ve already operated for a long time in the 2nd DC and took many writes, that we were unwilling to lose. We were OK to lose (after auditing) the few seconds of writes on the isolated DC. But, how do we converge the data?

Backups are the trivial way out, but they incur long recovery time. Shipping backup data over the network for dozens of servers takes time. Restore time, catching up with changes since backup took place, warming up the servers so that they can handle production traffic, all take time.

Could we have reduces the time for recovery?

There are multiple ways to do that: local backups, local delayed replicas, snapshots… We have embarked on several. In this post I wish to outline gh-mysql-rewind, which programmatically identifies the rogue (aka “bad”) transactions on the network isolated master, rewinds/reverts them, applies some bookkeeping and restores the demoted master as a healthy replica under the newly promoted master, thereby prepared to be promoted if needed.

General overview

gh-mysql-rewind is a shell script. It utilizes multiple technologies, some of which do not speak to each other, to be able to do the magic. It assumes and utilizes the following:

MySQL GTID replication
Row based replication (binlog_format=ROW)
binlog_row_image=FULL
Use of MariaDB Flashback
Some limitations apply

Some breakdown follows.

GTID

MySQL GTIDs keep track of all transactions executed on a given server. GTIDs indicate which server (UUID) originated a write, and ranges of transaction sequences. In a clean state, only one writer will generate GTIDs, and on all the replicas we would see the same GTID set, originated with the writer’s UUID.

In a split brain scenario, we would see divergence. It is possible to use GTID_SUBTRACT(old_master-GTIDs, new-master-GTIDs) to identify the exact set of transactions executed on the old, demoted master, right after the failover. This is the essence of the split brain.

For example, assume that just before the network partition, GTID on the master was 00020192-1111-1111-1111-111111111111:1-5000. Assume after the network partition the new master has UUID of 00020193-2222-2222-2222-222222222222. It began to take writes, and after some time its GTID set showed 00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200.

On the demoted master, other writes took place, leading to the GTID set 00020192-1111-1111-1111-111111111111:1-5042.

We will run…

SELECT GTID_SUBTRACT(
  '00020192-1111-1111-1111-111111111111:1-5042',
  '00020192-1111-1111-1111-111111111111:1-5000,00020193-2222-2222-2222-222222222222:1-200'
);

> '00020192-1111-1111-1111-111111111111:5001-5042'

…to identify the exact set of “bad transactions” on the demoted master.

Row Based Replication

With row based replication, and with FULL image format, each DML (INSERT, UPDATE, DELETE) writes to the binary log the complete row data before and after the operation. This means the binary log has enough information for us to revert the operation.

Flashback

Developed by Alibaba, flashback has been incorporated in MariaDB. MariaDB’s mysqlbinlog utility supports a --flashback flag, which interprets the binary log in a special way. Instead of printing out the events in the binary log in order, it prints the inverted operations in reverse order.

To illustrate, let’s assume this pseudo-code sequence of events in the binary log:

insert(1, 'a')
insert(2, 'b')
insert(3, 'c')
update(2, 'b')->(2, 'second')
update(3, 'c')->(3, 'third')
insert(4, 'd')
delete(1, 'a')

A --flashback of this binary log would produce:

insert(1, 'a')
delete(4, 'd')
update(3, 'third')->(3, 'c')
update(2, 'second')->(2, 'b')
delete(3, 'c')
delete(2, 'b')
delete(1, 'a')

Alas, MariaDB and flashback do not speak MySQL GTID language. GTIDs are one of the major points where MySQL and MariaDB have diverged beyond compatibility.

The output of MariaDB’s mysqlbinlog --flashback has neither any mention of GTIDs, nor does the tool take notice of GTIDs in the binary logs in the first place.

gh-mysql-rewind

This is where we step in. GTIDs provide the information about what went wrong. flashback has the mechanism to generate the reverse sequence of statements. gh-mysql-rewind:

uses GTIDs to detect what went wrong
correlates those GTID entries with binary log files: identifies which binary logs actually contain those GTID events
invokes MariaDB’s mysqlbinlog --flashback to generate the reverse of those binary logs
injects (dummy) GTID information into the output
computes ETA

This last part is worth elaborating. We have created a time machine. We have the mechanics to make it work. But as any Sci-Fi fan knows, one of the most important parts of time travel is knowing ahead where (when) you are going to land. Are you back in the Renaissance? Or are you suddenly to appear on board the French Revolution? Better dress accordingly.

In our scenario it is not enough to move MySQL back in time to some consistent state. We want to know at what time we landed, so that we can instruct the rewinded server to join the replication chain as a healthy replica. In MySQL terms, we need to make MySQL “forget” everything that ever happened after the split brain: not only in terms of data (which we already did), but in terms of GTID history.

gh-mysql-rewind will do the math to project, ahead of time, at what “time” (i.e. GTID set) our time machine arrived. It will issue a `RESET MASTER; SET GLOBAL gtid_purged=’gtid-of-the-landing-time'” to make our re-winded MySQL consistent not only with some past dataset, but also with its own perception of the point in time where that dataset existed.

Limitations

Some limitations are due to MariaDB’s incompatibility with MySQL, some are due to MySQL DDL nature, some due to the fact gh-mysql-rewind is a shell script.

Cannot rewind DDL. DDLs are silently ignored, and will impose a problem when trying to re-apply them.
JSON, POINT data types are not supported.
The logic rewinds the MySQL server farther into the past than strictly required. This simplifies the code considerably, but imposed superfluous time to rewind+reapply, i.e. time to recover.
Currently, this only works one server at a time. If a group of 10 servers were network isolated together, the operation would need to run on each of these 10 servers.
Runs locally on each server. Requires both MySQL’s mysqlbinlog as well as MariaDB’s mysqlbinlog.

Testing

There’s lot of moving parts to this mechanism. A mixture of technologies that don’t normally speak to each other, injection of data, prediction of ETA… How reliable is all this?

We run continuous gh-mysql-rewind testing in production to consistently prove that it works as expected. Our testing uses a non-production, dedicated, functional replica. It contaminates the data on the replica. It lets gh-mysql-rewind automatically move it back in time, it joins the replica back into the healthy chain.

That’s not enough. We actually create a scenario where we can predict, ahead of testing, what the time-of-arrival will be. We checksum the data on that replica at that time. After contaminating and effectively breaking replication, we expect gh-mysql-rewind to revert the changes back to our predicted point in time. We checksum the data again. We expect 100% match.

See the video or slides for more detail on our testing setup.

Status

At this time the tool in one of several solutions we hope to never need to employ. It is stable and tested. We are looking forward to a promising MySQL development that will provide GTID-revert capabilities using standard commands, such as SELECT undo_transaction('00020192-1111-1111-1111-111111111111:5042').

We have released gh-mysql-rewind as open source, under the MIT license. The public release is a stripped down version of our own script, which has some GitHub-specific integration. We have general ideas in incorporating this functionality into higher level tools.

gh-mysql-rewind is developed by the database-infrastructure team at GitHub.

code.openark.org

Things that don’t work well with MySQL’s FOREIGN KEY implementation

No binary log entries for cascaded writes

Online DDL, aka online schema changes

Locked data types

CREATE TABLE … LIKE

Unique constraint names

Lack of declarative-only definitions

SET FOREIGN_KEY_CHECKS

Limited to server scope

Reducing my OSS involvement, and how it affects orchestrator & gh-ost

The “What”

When? For how long?

What does this mean for orchestrator?

What does this mean for gh-ost?

On community interaction

On community assistance

Can we jump in as contributors?

The problem with MySQL foreign key constraints in Online Schema Changes

Why Online Schema Change?

Isn’t there some Online DDL?

Isn’t there some Instant DDL?

Not everyone is Google or Facebook scale, right?

Let’s discuss foreign key constraints

Is this a biased post? We hear you don’t like foreign keys

Wait! pt-online-schema-change does support foreign keys! There’s command line flags for that!

We begin, let’s first present a model

Some analysis, rules and facts

Changing a child table

ERROR!

Well, why don’t you just drop that old constraint?

Well, why don’t you just drop the old table?

Problem: DROP

Problem: outage

Child-side: summary

Changing a parent table

Just drop the old table?

ALTER on parent implies ALTER on children

How does pt-online-schema-change solve this?

Straight ALTER on children, best case scenario?

Straight ALTER on children, failures?

If children tables are large?

So, let’s ALTER the children with Online Schema Change?

When can we start OSC on children?

When do we cut-over each migration?

Best case scenario for cutting-over

Less than best case scenario is a disaster

Disk space

Run time

Parent-side: summary

Changing a deep nested relationship

How do you ALTER a table that is both a parent and a child?

Overall summary

orchestrator on DB AMA: show notes

orchestrator: what’s new in CI, testing & development

Testing

Build, unit testing, integration testing

Upgrade testing

System testing

Development

Build on GitHub

Build via Docker

CI environment: the “full orchestrator experience”

Community

Pulling this blog out of Planet MySQL aggregator, over community concerns

Why filtering is harmful for the community

The state of Orchestrator, 2020 (spoiler: healthy)

The state of Orchestrator, 2020 (spoiler: healthy)

2020

Quick hack for GTID_OWN lack

Getting OWN_GTID

Un-split brain MySQL via gh-mysql-rewind

Motivation

General overview

GTID

Row Based Replication

Flashback

gh-mysql-rewind

Limitations

Testing

How do you `ALTER` a table that is both a parent and a child?