Discussing online schema migrations with Oracle's MySQL engineering managers

November 23, 2016

Last week I had the pleasant opportunity of introducing and discussing the operation of online schema migrations to MySQL's engineering managers, as part of their annual meeting, in London.

Together with Simon J. Mudd of Booking.com, we discussed our perception of what it takes to run online schema migrations on a live, busy system.

While the Oracle/MySQL engineers develop new features or optimize behavior in the MySQL, we of the industry have the operational expertise and understanding of the flow of working with MySQL. In all topics, and in schema migration in particular, there is a gap between what's perceived to be the use case and what the use case actually is. It is the community's task to provide feedback back to Oracle so as to align development to match operations need where possible.

Our meeting included the following:

Need for schema migrations

We presented, based on our experience in current and past companies, and based on our friends of the community's experience, the case for online schema migrations. At GitHub, at Booking.com and in many other companies I'm familiar with, we continuously deploy to production, and this implies continuous schema migrations to our production databases. We have migrations running daily; sometimes multiple per day, some time none. Continue Reading »

Three wishes for a new year

September 28, 2016

(Almost) another new year by Jewish calendar. What do I wish for the following year?

  1. World peace
  2. Good health to all
  3. Relaxed GTID constraints

I'm still not using GTID, and still see operational issues with working with GTID. As a latest example, our new schema migration solution, gh-ost, allows us to test migrations in production, on replicas. The GTID catch? gh-ost has to write something to the binary log. Thus, it "corrupts" the replica with a bogus GTID entry that will never be met in another server, thus making said replica unsafe to promote. We can work around this, but...

I understand the idea and need for the Executed GTID Set. It will certainly come in handy with multi-writer InnoDB Cluster. However for most use cases GTID poses a burden. The reason is that our topologies are imperfect, and we as humans are imperfect, and operations are most certainly imperfect. We may wish to operate on a replica: test something, by intention or mistake. We may wish to use a subchain as the seed for a new cluster split. We may wish to be able to write to downstream replicas. We may use a 3rd party tool that issues a flush tables with read lock without disabling sql_log_bin. Things just happen.

For that, I would like to suggest GTID control levels, such as:

  1. Strict: same as Oracle's existing implementation. Executed sets, purged sets, whatnot.
  2. Last executed: a mode where the only thing that counts is the last executed GTID value. If I repoint replica, all it needs to check is "hey this is my last executed GTID entry, give me the coordinates of yours. And, no, I don't care about comparing executed and purged sets, I will trust you and keep running from that point on"
  3. Declarative: GTIDs are generated, are visible in each and every binary log entry, but are completely ignored.

I realize Oracle MySQL GTID is out for some over 3 years now, but I'm sorry - I still have reservations and see use cases where I fear it will not serve me right.

How about my previous years wishes? World peace and good health never came through, however:

  • My 2015 wish for "decent, operations friendly built in online table refactoring" was unmet, however gh-ost is a thing now and exceeds my expectations. No, really. Please come see Tom & myself present gh-ost and how it changed our migration paradigm.
  • My 2012 wish for "decent, long waited for, implementation of Window Functions (aka Analytic Functions) for MySQL" was met by MariaDB's window functions.
    Not strictly Window Functions, but Oracle MySQL 8.0 will support CTE (hierarchial/recursive), worth a mention.

See you in Amsterdam!

gh-ost 1.0.17: Hooks, Sub-second lag control, Amazon RDS and more

September 6, 2016

gh-ost version 1.0.17 is now released, with various additions and fixes. Here are some notes of interest:


gh-ost now supports hooks. These are your own executables that gh-ost will invoke at particular points of interest (validation pass, about to cut-over, success, failure, status, etc.)

gh-ost will set various environment variables for your executables to pick up, passing along such information as migrated/ghost table name, elapsed time, processed rows, migrated host etc.

Sub-second lag control

At GitHub we're very strict about replication lag. We keep it well under 1 second at most times. gh-ost can now identify sub-second lag on replicas (well, you need to supply with the right query). Our current production migrations are set by default with --max-lag-millis=500 or less, and our most intensive migrations keep replication lag well below 1sec or even below 500ms


The SUPER privilege is required to set global binlog_format='ROW' and for STOP SLAVE; START SLAVE;

If you know your replica has RBR, you can pass --assume-rbr and skips those steps.


Hooks + No Super = RDS, as seems to be the case. For --test-on-replica you will need to supply your own gh-ost-on-stop-replication hook, to stop your RDS replica at cut-over phase. See this tracking issue Continue Reading »

MySQL vs. PostgreSQL, gh-ost perspective

August 11, 2016

Last week we released gh-ost, GitHub's online schema migration tool for MySQL. As with other open source releases in the MySQL ecosystem, this release was echoed by several "Why not PostgreSQL?" comments. Having been active in open source since many years now, I'm familiar with these responses, and I find this is a good time to share my thoughts. Why? XKCD knows the answer:

XKCD: Duty Calls

I picked one post I wish to address (latest commit: 3dfbd2cd3f5468f035ec86442d2c670a510118d8). The author invested some time writing it. It nicely summarizes claims I've heard over the years, as well as some prejudice. Through responding to this post I will be generalizing thoughts and impressions to address the common reactions. Dear @brandur, let's grab a beer some day; I fundamentally disagree with your post and with its claims.

EDIT: linked post has been updated following this writing; I'd like to thank the author for his consideration. Also see his followup post. The version I've responded to in this post is this commit. Continue Reading »

Introducing gh-ost: triggerless online schema migrations

August 1, 2016

I'm thoroughly happy to introduce gh-ost: triggerless, controllable, auditable, testable, trusted online schema change tool released today by GitHub.

gh-ost now powers our production schema migrations. We hit some serious limitations using pt-online-schema-change on our large volume, high traffic tables, to the effect of driving our database to a near grinding halt or even to the extent of causing outages. With gh-ost, we are now able to migrate our busiest tables at any time, peak hours and heavy workloads included, without causing impact to our service.

gh-ost supports testing in production. It goes a long way to build trust, both in integrity and in control. Are your databases just too busy and you cannot run existing online-schema-change tools? Have you suffered outages due to migrations? Are you tired of babysitting migrations that run up to 3:00am? Tired of being the only one tailing logs? Please, take a look at gh-ost. I believe it changes online migration paradigm.

For a more thorough overview, please read the announcement on the GitHub Engineering Blog, and proceed to the documentation.

gh-ost is open sourced under the MIT license.

Solving the non-atomic table swap, Take III: making it atomic

July 7, 2016

With the unintended impression of becoming live blogging, we now follow up on Solving the non-atomic table swap, Take II and Solving the Facebook-OSC non-atomic table swap problem with a safe, blocking, atomic solution

Why yet another iteration?

The solution presented in Solving the non-atomic table swap, Take II was good, in that it was safe. No data corruption. Optimistic: if no connection is killed throughout the process, then completely blocking.

Two outstanding issues remained:

  • If something did go wrong, the solution reverted to a table-outage
  • On replicas, the table swap is non atomic, non blocking. There's table-outage scenario on replica.

As it turns out, there's a simpler solution which overcomes both the above. As with math and physics, the simpler solution is often the preferred one. But it took those previous iterations to gather a few ideas together. So, anyway:

Safe, locking, atomic, asynchronous table swap

Do read the aforementioned previous posts; the quick-quick recap is: we want to be able to LOCK a table tbl, then do some stuff, then swap it out and put some ghost table in its place. MySQL does not allow us to rename tbl to tbl_old, ghost to tbl if we have locks on tbl in that session.

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

  • Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE
  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
  • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE
  • Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK
  • Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl
    This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
  • Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue
  • Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in processlist)
  • Connection 10: DROP TABLE tbl_old
    Nothing happens yet; tbl is still locked. All other connections still blocked.
  • Connection 10: UNLOCK TABLES
    The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes Continue Reading »

Solving the non-atomic table swap, Take II

June 20, 2016

Following up and improving on Solving the Facebook-OSC non-atomic table swap problem, we present a better, safe solution.

Quick, quickest recap:

We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the FB osc and as opposed to the synchronous solution as used by pt-online-schema-change.

We asynchronously synchronize (is that even a valid statement?) between some table tbl and a ghost table ghost, and at some time we want to cut-over: swap the two; kick out tbl and put ghost in its place and under its name.

However, we cannot use the single statement rename tbl to tbl_old, ghost to tbl, because we use the asynchronous approach, where at the time we lock tbl for writes, we still have some events we need to process and apply onto ghost before swapping the two.

And MySQL does not allow a lock tables tbl write; ... ; rename tbl to tbl_old, ghost to tbl.

In Solving the Facebook-OSC non-atomic table swap problem we suggested a way that works, unless when it doesn't work. Read the caveat at the end of the post. Premature death of a connection that participates in the algorithm causes a chain reaction that leads to the premature execution of the rename statement, potentially before we've applied those remaining events. This leads to data inconsistency between the old table and the new table, and is unacceptable.

To that effect, we were more inclined to go with the Facebook solution, which makes a two-step: lock tables tbl write; alter table tbl rename to tbl_old; ... ; alter table ghost rename to tbl;

This two-step solution is guaranteed not to have data inconsistency. Alas, it also implies an outage. There's a brief moment, in between the two renames, and during that time where we apply those last changes, where the table tbl is simply not there.

Not all applications will fail gracefully on such a scenario. Continue Reading »

Solving the Facebook-OSC non-atomic table swap problem

May 3, 2016

We present a way to use an atomic, blocking table swap in the Facebook Online-Schema-Change solution, as well as in a rumored, other Online-Schema-rumored-Change solution. Update: also a caveat.

Quick recap (really quick)

pt-online-schema-change and facebook-osc are two popular online-schema-change solutions for MySQL. They both use triggers, but in different ways. While the Percona tool uses synchronous table updates, such that any INSERT|UPDATE|DELETE on the modified table causes an INSERT|UPDATE|DELETE on a ghost table, in the Facebook tool all cause an INSERT on a changelog table, which is then iterated, read, having entries applied on the ghost table.

The TL;DR is that DMLs on the table propagate synchronously, within same transaction in the Percona tool, and asynchronously, with lag, in the Facebook tool.

What's the problem with the table swap?

In the Percona tool, once the logic is satisfied the copy is complete, we issue this query:

RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

This is an atomic, two table RENAME operation.

However with the asynchronous nature of the Facebook tool, such a RENAME would be a mistake. We must first block writes to the modified table, then make sure we have iterated the changelog table to the point of lock, apply those changes onto the ghost table, and only then do the swap.

The problem is: you cannot RENAME TABLES while one of them is LOCKed.

This is silly, and inconsistent, because:

Query OK, 0 rows affected (0.00 sec)

> RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;
ERROR 1192 (HY000): Can't execute the given command because you have active locked tables or an active transaction

> ALTER TABLE tbl RENAME TO tbl_old;
Query OK, 0 rows affected (0.00 sec)

Why would the RENAME fail where the ALTER works?

Small thing, but critical to the operation of the online-schema-change. From the Facebook OSC documentation:

Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a 'table not found' error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.

What the FB solution means

It means for a very brief duration, the table is not there. Your app will get errors.

Of course, we should be able to handle errors anytime, aywhere. But the honest truth is: we (as in the world) do not. Many apps will fail ungracefully should they get a table not found error.

An atomic swap, as compared, would make for briefly blocking operations, making the app ignorant of the swap.


Rumor has it that we at GitHub are developing a new, triggerless, Online Schema Change tool. It is rumored to be based off binary logs and is rumored to have lots of interesting rumored implications.

Such rumored implementation would have to be asynchronous by nature, or so rumors say. And as such, it would fall for the same non-atomic table swap problem.


Once we heard it was rumored we were working on a triggerless online schema change tool, we realized we would have to solve the non-atomic swap problem. What we did was to gossip about it in between ourselves, which led to three different rumors of a solution, eventually manifested as three different working solutions. All three solutions make for blocking queries on the app's side. I will present one of these solution here, based on voluntary locks. Continue Reading »

MySQL Community Awards 2016: the Winners

April 22, 2016

The MySQL Community Awards initiative is an effort to acknowledge and thank individuals and corporates for their contributions to the MySQL ecosystem. It is a from-the-community, by-the-community and for-the-community effort. The committee is composed of an independent group of community members of different orientation and opinion, themselves past winners or known contributors to the community.

The 2016 community awards were presented on April 21st, 2016, during the keynotes at the Percona Live conference. The winners are:

MySQL Community Awards: Community Contributor of the year 2016

  • Bill Karwin
    Bill has been working with the community for years, helping them understand SQL. Bill is the author of the great book "SQL Antipatterns". He has given a large amount of help on sites such as StackOverflow, Quora, and of course many conference talks. Bill has provided a huge amount of help to the community directly.
  • Domas Mituzas
    Domas Mituzas started in the MySQL ecosystem as a MySQL Support Engineer at MySQL AB. Since he had some spare time, he did a lot of work to scale MySQL at Wikipedia. He is now a small data engineer at Facebook, mostly working with user-facing data systems. He continues to write very interesting blog posts and bug reports. Domas is responsible for giving us MyDumper, PoorMansProfiler, and the infamous Query Cache tuner!
  • Yoshinori Matsunobu
    Yoshinori Matsunobu is currently leading the MyRocks effort to get the RocksDB storage engine for MySQL into production at Facebook. Previously (amongst his other accomplishments) he created HandlerSocket, and implemented MHA to support failover automation for MySQL – both of which have been used at many companies. He is a frequent speaker at community events, and his tutorials and slide decks do a lot to increase expertise in the community. He is a frequent bug reporter with a focus on replication (RBR, semi-sync).

MySQL Community Awards: Application of the year 2016

Continue Reading »

MySQL Community Awards 2016: Call for Nominations!

February 5, 2016

The 2016 MySQL Community Awards event will take place, as usual, in Santa Clara, during the Percona Live Data Performance Conference, April 2016.

The MySQL Community Awards is a community based initiative. The idea is to publicly recognize contributors to the MySQL ecosystem. The entire process of discussing, voting and awarding is controlled by an independent group of community members, typically based of past winners or their representatives, as well as known contributors.

It is a self-appointed, self-declared, self-making-up-the-rules-as-it-goes committee. It is also very aware of the importance of the community; a no-nonsense, non-political, adhering to tradition, self criticizing committee.

The Call for Nominations is open. We are seeking the community’s assistance in nominating candidates in the following categories:

MySQL Community Awards: Community Contributor of the year 2016

This is a personal award; a winner would a person who has made contribution to the MySQL ecosystem. This could be via development, advocating, blogging, speaking, supporting, etc. All things go.

MySQL Community Awards: Application of the year 2016

An application, project, product etc. which supports the MySQL ecosystem by either contributing code, complementing its behaviour, supporting its use, etc. This could range from a one man open source project to a large scale social service.

MySQL Community Awards: Corporate Contributor of the year 2016

A company who made contribution to the MySQL ecosystem. This might be a corporate which released major open source code; one that advocates for MySQL; one that help out community members by... anything.

For a list of previous winners, please see MySQL Hall of Fame. Continue Reading »

Powered by Wordpress and MySQL. Theme by openark.org