orchestrator Puppet module now available

February 1, 2017

We have just open sourced and published an orchestrator puppet module. This module is authored by Tom Krouper of GitHub's database infrastructure team, and is what we use internally at GitHub for deploying orchestrator.

The module manages the orchestrator service, the config file (inherit to override values), etc (pun intended). Check it out!



Some observations on MySQL to sqlite migration & compatibility

January 30, 2017

I'm experimenting with sqlite as backend database for orchestrator. While orchestrator manages MySQL replication topologies, it also uses MySQL as backend. For some deployments, and I'm looking into such one, having MySQL as backend is a considerable overhead.

This sent me to the route of looking into a self contained orchestrator binary + backend DB. I would have orchestrator spawn up its own backend database instead of connecting to an external one.

Why even relational?

Can't orchestrator just use a key-value backend?

Maybe it could. But frankly I enjoy the power of relational databases, and the versatility they offer has proven itself multiple times with orchestrator, being able to answer interesting, new, complex questions about one's topology by crafting SQL queries.

Moreover, orchestrator is already heavily invested in the relational model. At this time, replacing all SQL queries with key-value reads seems to me as a significant investment in time and risk. So I was looking for a relational, SQL accessible embeddable database for orchestrator.

Why sqlite?

I am in particular looking at two options: sqlite (via the go-sqlite3 binding) and TiDB. sqlite does not need much introduction, and I'll just say it's embeddable within the golang-built binary. Continue Reading »

Discussing online schema migrations with Oracle's MySQL engineering managers

November 23, 2016

Last week I had the pleasant opportunity of introducing and discussing the operation of online schema migrations to MySQL's engineering managers, as part of their annual meeting, in London.

Together with Simon J. Mudd of Booking.com, we discussed our perception of what it takes to run online schema migrations on a live, busy system.

While the Oracle/MySQL engineers develop new features or optimize behavior in the MySQL, we of the industry have the operational expertise and understanding of the flow of working with MySQL. In all topics, and in schema migration in particular, there is a gap between what's perceived to be the use case and what the use case actually is. It is the community's task to provide feedback back to Oracle so as to align development to match operations need where possible.

Our meeting included the following:

Need for schema migrations

We presented, based on our experience in current and past companies, and based on our friends of the community's experience, the case for online schema migrations. At GitHub, at Booking.com and in many other companies I'm familiar with, we continuously deploy to production, and this implies continuous schema migrations to our production databases. We have migrations running daily; sometimes multiple per day, some time none. Continue Reading »

Three wishes for a new year

September 28, 2016

(Almost) another new year by Jewish calendar. What do I wish for the following year?

  1. World peace
  2. Good health to all
  3. Relaxed GTID constraints

I'm still not using GTID, and still see operational issues with working with GTID. As a latest example, our new schema migration solution, gh-ost, allows us to test migrations in production, on replicas. The GTID catch? gh-ost has to write something to the binary log. Thus, it "corrupts" the replica with a bogus GTID entry that will never be met in another server, thus making said replica unsafe to promote. We can work around this, but...

I understand the idea and need for the Executed GTID Set. It will certainly come in handy with multi-writer InnoDB Cluster. However for most use cases GTID poses a burden. The reason is that our topologies are imperfect, and we as humans are imperfect, and operations are most certainly imperfect. We may wish to operate on a replica: test something, by intention or mistake. We may wish to use a subchain as the seed for a new cluster split. We may wish to be able to write to downstream replicas. We may use a 3rd party tool that issues a flush tables with read lock without disabling sql_log_bin. Things just happen.

For that, I would like to suggest GTID control levels, such as:

  1. Strict: same as Oracle's existing implementation. Executed sets, purged sets, whatnot.
  2. Last executed: a mode where the only thing that counts is the last executed GTID value. If I repoint replica, all it needs to check is "hey this is my last executed GTID entry, give me the coordinates of yours. And, no, I don't care about comparing executed and purged sets, I will trust you and keep running from that point on"
  3. Declarative: GTIDs are generated, are visible in each and every binary log entry, but are completely ignored.

I realize Oracle MySQL GTID is out for some over 3 years now, but I'm sorry - I still have reservations and see use cases where I fear it will not serve me right.

How about my previous years wishes? World peace and good health never came through, however:

  • My 2015 wish for "decent, operations friendly built in online table refactoring" was unmet, however gh-ost is a thing now and exceeds my expectations. No, really. Please come see Tom & myself present gh-ost and how it changed our migration paradigm.
  • My 2012 wish for "decent, long waited for, implementation of Window Functions (aka Analytic Functions) for MySQL" was met by MariaDB's window functions.
    Not strictly Window Functions, but Oracle MySQL 8.0 will support CTE (hierarchial/recursive), worth a mention.

See you in Amsterdam!

gh-ost 1.0.17: Hooks, Sub-second lag control, Amazon RDS and more

September 6, 2016

gh-ost version 1.0.17 is now released, with various additions and fixes. Here are some notes of interest:


gh-ost now supports hooks. These are your own executables that gh-ost will invoke at particular points of interest (validation pass, about to cut-over, success, failure, status, etc.)

gh-ost will set various environment variables for your executables to pick up, passing along such information as migrated/ghost table name, elapsed time, processed rows, migrated host etc.

Sub-second lag control

At GitHub we're very strict about replication lag. We keep it well under 1 second at most times. gh-ost can now identify sub-second lag on replicas (well, you need to supply with the right query). Our current production migrations are set by default with --max-lag-millis=500 or less, and our most intensive migrations keep replication lag well below 1sec or even below 500ms


The SUPER privilege is required to set global binlog_format='ROW' and for STOP SLAVE; START SLAVE;

If you know your replica has RBR, you can pass --assume-rbr and skips those steps.


Hooks + No Super = RDS, as seems to be the case. For --test-on-replica you will need to supply your own gh-ost-on-stop-replication hook, to stop your RDS replica at cut-over phase. See this tracking issue Continue Reading »

MySQL vs. PostgreSQL, gh-ost perspective

August 11, 2016

Last week we released gh-ost, GitHub's online schema migration tool for MySQL. As with other open source releases in the MySQL ecosystem, this release was echoed by several "Why not PostgreSQL?" comments. Having been active in open source since many years now, I'm familiar with these responses, and I find this is a good time to share my thoughts. Why? XKCD knows the answer:

XKCD: Duty Calls

I picked one post I wish to address (latest commit: 3dfbd2cd3f5468f035ec86442d2c670a510118d8). The author invested some time writing it. It nicely summarizes claims I've heard over the years, as well as some prejudice. Through responding to this post I will be generalizing thoughts and impressions to address the common reactions. Dear @brandur, let's grab a beer some day; I fundamentally disagree with your post and with its claims.

EDIT: linked post has been updated following this writing; I'd like to thank the author for his consideration. Also see his followup post. The version I've responded to in this post is this commit. Continue Reading »

Introducing gh-ost: triggerless online schema migrations

August 1, 2016

I'm thoroughly happy to introduce gh-ost: triggerless, controllable, auditable, testable, trusted online schema change tool released today by GitHub.

gh-ost now powers our production schema migrations. We hit some serious limitations using pt-online-schema-change on our large volume, high traffic tables, to the effect of driving our database to a near grinding halt or even to the extent of causing outages. With gh-ost, we are now able to migrate our busiest tables at any time, peak hours and heavy workloads included, without causing impact to our service.

gh-ost supports testing in production. It goes a long way to build trust, both in integrity and in control. Are your databases just too busy and you cannot run existing online-schema-change tools? Have you suffered outages due to migrations? Are you tired of babysitting migrations that run up to 3:00am? Tired of being the only one tailing logs? Please, take a look at gh-ost. I believe it changes online migration paradigm.

For a more thorough overview, please read the announcement on the GitHub Engineering Blog, and proceed to the documentation.

gh-ost is open sourced under the MIT license.

Solving the non-atomic table swap, Take III: making it atomic

July 7, 2016

With the unintended impression of becoming live blogging, we now follow up on Solving the non-atomic table swap, Take II and Solving the Facebook-OSC non-atomic table swap problem with a safe, blocking, atomic solution

Why yet another iteration?

The solution presented in Solving the non-atomic table swap, Take II was good, in that it was safe. No data corruption. Optimistic: if no connection is killed throughout the process, then completely blocking.

Two outstanding issues remained:

  • If something did go wrong, the solution reverted to a table-outage
  • On replicas, the table swap is non atomic, non blocking. There's table-outage scenario on replica.

As it turns out, there's a simpler solution which overcomes both the above. As with math and physics, the simpler solution is often the preferred one. But it took those previous iterations to gather a few ideas together. So, anyway:

Safe, locking, atomic, asynchronous table swap

Do read the aforementioned previous posts; the quick-quick recap is: we want to be able to LOCK a table tbl, then do some stuff, then swap it out and put some ghost table in its place. MySQL does not allow us to rename tbl to tbl_old, ghost to tbl if we have locks on tbl in that session.

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

  • Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE
  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
  • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE
  • Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK
  • Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl
    This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
  • Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue
  • Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in processlist)
  • Connection 10: DROP TABLE tbl_old
    Nothing happens yet; tbl is still locked. All other connections still blocked.
  • Connection 10: UNLOCK TABLES
    The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes Continue Reading »

Solving the non-atomic table swap, Take II

June 20, 2016

Following up and improving on Solving the Facebook-OSC non-atomic table swap problem, we present a better, safe solution.

Quick, quickest recap:

We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the FB osc and as opposed to the synchronous solution as used by pt-online-schema-change.

We asynchronously synchronize (is that even a valid statement?) between some table tbl and a ghost table ghost, and at some time we want to cut-over: swap the two; kick out tbl and put ghost in its place and under its name.

However, we cannot use the single statement rename tbl to tbl_old, ghost to tbl, because we use the asynchronous approach, where at the time we lock tbl for writes, we still have some events we need to process and apply onto ghost before swapping the two.

And MySQL does not allow a lock tables tbl write; ... ; rename tbl to tbl_old, ghost to tbl.

In Solving the Facebook-OSC non-atomic table swap problem we suggested a way that works, unless when it doesn't work. Read the caveat at the end of the post. Premature death of a connection that participates in the algorithm causes a chain reaction that leads to the premature execution of the rename statement, potentially before we've applied those remaining events. This leads to data inconsistency between the old table and the new table, and is unacceptable.

To that effect, we were more inclined to go with the Facebook solution, which makes a two-step: lock tables tbl write; alter table tbl rename to tbl_old; ... ; alter table ghost rename to tbl;

This two-step solution is guaranteed not to have data inconsistency. Alas, it also implies an outage. There's a brief moment, in between the two renames, and during that time where we apply those last changes, where the table tbl is simply not there.

Not all applications will fail gracefully on such a scenario. Continue Reading »

Solving the Facebook-OSC non-atomic table swap problem

May 3, 2016

We present a way to use an atomic, blocking table swap in the Facebook Online-Schema-Change solution, as well as in a rumored, other Online-Schema-rumored-Change solution. Update: also a caveat.

Quick recap (really quick)

pt-online-schema-change and facebook-osc are two popular online-schema-change solutions for MySQL. They both use triggers, but in different ways. While the Percona tool uses synchronous table updates, such that any INSERT|UPDATE|DELETE on the modified table causes an INSERT|UPDATE|DELETE on a ghost table, in the Facebook tool all cause an INSERT on a changelog table, which is then iterated, read, having entries applied on the ghost table.

The TL;DR is that DMLs on the table propagate synchronously, within same transaction in the Percona tool, and asynchronously, with lag, in the Facebook tool.

What's the problem with the table swap?

In the Percona tool, once the logic is satisfied the copy is complete, we issue this query:

RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

This is an atomic, two table RENAME operation.

However with the asynchronous nature of the Facebook tool, such a RENAME would be a mistake. We must first block writes to the modified table, then make sure we have iterated the changelog table to the point of lock, apply those changes onto the ghost table, and only then do the swap.

The problem is: you cannot RENAME TABLES while one of them is LOCKed.

This is silly, and inconsistent, because:

Query OK, 0 rows affected (0.00 sec)

> RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;
ERROR 1192 (HY000): Can't execute the given command because you have active locked tables or an active transaction

> ALTER TABLE tbl RENAME TO tbl_old;
Query OK, 0 rows affected (0.00 sec)

Why would the RENAME fail where the ALTER works?

Small thing, but critical to the operation of the online-schema-change. From the Facebook OSC documentation:

Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a 'table not found' error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.

What the FB solution means

It means for a very brief duration, the table is not there. Your app will get errors.

Of course, we should be able to handle errors anytime, aywhere. But the honest truth is: we (as in the world) do not. Many apps will fail ungracefully should they get a table not found error.

An atomic swap, as compared, would make for briefly blocking operations, making the app ignorant of the swap.


Rumor has it that we at GitHub are developing a new, triggerless, Online Schema Change tool. It is rumored to be based off binary logs and is rumored to have lots of interesting rumored implications.

Such rumored implementation would have to be asynchronous by nature, or so rumors say. And as such, it would fall for the same non-atomic table swap problem.


Once we heard it was rumored we were working on a triggerless online schema change tool, we realized we would have to solve the non-atomic swap problem. What we did was to gossip about it in between ourselves, which led to three different rumors of a solution, eventually manifested as three different working solutions. All three solutions make for blocking queries on the app's side. I will present one of these solution here, based on voluntary locks. Continue Reading »

Powered by Wordpress and MySQL. Theme by openark.org