Solving the Facebook-OSC non-atomic table swap problem

May 3, 2016

We present a way to use an atomic, blocking table swap in the Facebook Online-Schema-Change solution, as well as in a rumored, other Online-Schema-rumored-Change solution. Update: also a caveat.

Quick recap (really quick)

pt-online-schema-change and facebook-osc are two popular online-schema-change solutions for MySQL. They both use triggers, but in different ways. While the Percona tool uses synchronous table updates, such that any INSERT|UPDATE|DELETE on the modified table causes an INSERT|UPDATE|DELETE on a ghost table, in the Facebook tool all cause an INSERT on a changelog table, which is then iterated, read, having entries applied on the ghost table.

The TL;DR is that DMLs on the table propagate synchronously, within same transaction in the Percona tool, and asynchronously, with lag, in the Facebook tool.

What's the problem with the table swap?

In the Percona tool, once the logic is satisfied the copy is complete, we issue this query:

RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

This is an atomic, two table RENAME operation.

However with the asynchronous nature of the Facebook tool, such a RENAME would be a mistake. We must first block writes to the modified table, then make sure we have iterated the changelog table to the point of lock, apply those changes onto the ghost table, and only then do the swap.

The problem is: you cannot RENAME TABLES while one of them is LOCKed.

This is silly, and inconsistent, because:

> LOCK TABLES tbl WRITE;
Query OK, 0 rows affected (0.00 sec)

> RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;
ERROR 1192 (HY000): Can't execute the given command because you have active locked tables or an active transaction

> ALTER TABLE tbl RENAME TO tbl_old;
Query OK, 0 rows affected (0.00 sec)

Why would the RENAME fail where the ALTER works?

Small thing, but critical to the operation of the online-schema-change. From the Facebook OSC documentation:

Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a 'table not found' error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.

What the FB solution means

It means for a very brief duration, the table is not there. Your app will get errors.

Of course, we should be able to handle errors anytime, aywhere. But the honest truth is: we (as in the world) do not. Many apps will fail ungracefully should they get a table not found error.

An atomic swap, as compared, would make for briefly blocking operations, making the app ignorant of the swap.

Rumor

Rumor has it that we at GitHub are developing a new, triggerless, Online Schema Change tool. It is rumored to be based off binary logs and is rumored to have lots of interesting rumored implications.

Such rumored implementation would have to be asynchronous by nature, or so rumors say. And as such, it would fall for the same non-atomic table swap problem.

Solution

Once we heard it was rumored we were working on a triggerless online schema change tool, we realized we would have to solve the non-atomic swap problem. What we did was to gossip about it in between ourselves, which led to three different rumors of a solution, eventually manifested as three different working solutions. All three solutions make for blocking queries on the app's side. I will present one of these solution here, based on voluntary locks.

The idea is to make a table locked without actually issuing a LOCK TABLE statement, such that we would be able to run a RENAME TABLE operation, that would wait until we say it's good to complete.

Let's assume:

  • Our table is tbl
  • Ghost table (table onto which we've actually made the changes) is tbl_new
  • Our app continuously writes to tbl
  • We have 3 connections on our tables, aptly named #1, #2, #3

We issue the following, in this order:

  1. #1:
    SELECT GET_LOCK('ding', 0);

    Lock acquired, no problems

  2. #2:
    SELECT RELEASE_LOCK('ding') FROM tbl WHERE GET_LOCK('ding', 999999)>=0 LIMIT 1;

    Ignore the RELEASE_LOCK for now, this is merely cleanup. The query attempts to read one row from tbl where GET_LOCK('ding')>=0. But 'ding' is locked, hence the entire query blocks.
    Otherwise, other queries on tbl (both reads and writes) are running fine.

  3. #3:
    RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

    Now the magic begins. The RENAME operation cannot proceed while queries are executing on tbl. In particular, it waits on #2 to complete. But #2 is blocked on #1, so it does not complete. Our RENAME is also blocked!
    There are further two consequences that work to our advantage:

    • Any further incoming INSERT|UPDATE|DELETE on tbl is now unable to proceed; such queries will now wait for the RENAME to complete. So no further updated on tbl are being applied. App is blocked
    • tbl_new is nonblocked! And this is because how RENAME works internally. Since it couldn't satisfy the first clause, it doesn't even check for the second, and does not place a LOCK on tbl_new.
  4. OSC:
    Now that no further writes are possible on tbl, we satisfy ourselves that we've iterated to the last of the changelog entries and have applied changes to tbl_new. Exactly how we satisfy ourselves is a matter of implementation. Rumor is that we use a rumor that the last entry was handled in our rumored solution. That last part is actually not a pun.
    We are now content that all changes have been applied to tbl_new.
  5. #1:
    SELECT RELEASE_LOCK('ding');

    Ding! Connection #2 gets released, reads some row from tbl (but no one is actually interested in the result set) and completes. The #3 RENAME is not blocking on anything anymore. It executes. The tables are swapped. Once they are swapped, any INSERT|UPDATE|DELETEs that were pending on tbl are released and App is unblocked.
    The atomic swap is complete.

Implementation

Agony. This workaround is agonizing. Is agonization a word? By rumor written in Go, our OSC has this implemented via goroutines, and the code is one of those atrocities you are ashamed to look at. Well, it's OK under the circumstances. But really, implementing this is painful, and actually more complicated than the above description. Why is that?

In the above we make fro two blocking operations: #2 and #3. We must not proceed to #3 before #2 is applied, and we must not proceed to OSC completion before #3 is applied. But how does our code know that it's being blocked? If it's being blocked, it can't tell me anything, because it's blocked. If it's not blocked yet, it can tell me it's not blocked yet, but I'm really interested in knowing the time it gets blocked.

But preferably the exact time, or near exact, because one we start blocking, App suffers. Connections accumulate. We really want to make the swap as quick as possible (and by rumor we have a rollback & retry mechanism for this operation if it exceeds X seconds).

Unfortunately the solution involves polling. That is, Once we issue #2 (asynchronously, right? It's blocking), we aggressively poll SHOW PROCESSLIST and look for that blocked query. And the same for #3. Polling is a form of necessary ugliness in this flow.

Other solutions

The other two solutions do not use a voluntary lock. Instead:

  1. Use a LOCK on some yet another table and a query involving that table JOINed with tbl
  2. A SELECT ... FOR UPDATE on yet another table followed by a SELECT on the locked row on that table JOINed with tbl.

We leave the implementation as an exercise for the reader.

Can't we just make the RENAME work under LOCK?

Yeah. That's what the Facebook people said. "Hey, we can just fix this".

Update: caveat

Should connection #1 or connection #2 die unexpectedly before we are satisfied the events have all been applied, the `RENAME` gets unblocked due to the collapse of locks, and we end up with a premature swap of the tables, potentially before we have applied the latest entries from the changelog table. This was noted by my colleague Gillian Gunson, and we keep looking into this.

 
Powered by Wordpress and MySQL. Theme by openark.org