Solving the non-atomic table swap, Take II

Following up and improving on Solving the Facebook-OSC non-atomic table swap problem, we present a better, safe solution.

Quick, quickest recap:

We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the FB osc and as opposed to the synchronous solution as used by pt-online-schema-change.

We asynchronously synchronize (is that even a valid statement?) between some table tbl and a ghost table ghost, and at some time we want to cut-over: swap the two; kick out tbl and put ghost in its place and under its name.

However, we cannot use the single statement rename tbl to tbl_old, ghost to tbl, because we use the asynchronous approach, where at the time we lock tbl for writes, we still have some events we need to process and apply onto ghost before swapping the two.

And MySQL does not allow a lock tables tbl write; … ; rename tbl to tbl_old, ghost to tbl.

In Solving the Facebook-OSC non-atomic table swap problem we suggested a way that works, unless when it doesn’t work. Read the caveat at the end of the post. Premature death of a connection that participates in the algorithm causes a chain reaction that leads to the premature execution of the rename statement, potentially before we’ve applied those remaining events. This leads to data inconsistency between the old table and the new table, and is unacceptable.

To that effect, we were more inclined to go with the Facebook solution, which makes a two-step: lock tables tbl write; alter table tbl rename to tbl_old; … ; alter table ghost rename to tbl;

This two-step solution is guaranteed not to have data inconsistency. Alas, it also implies an outage. There’s a brief moment, in between the two renames, and during that time where we apply those last changes, where the table tbl is simply not there.

Not all applications will fail gracefully on such a scenario.

UDF

We looked at a solution based on UDFs, where we would create global wait conditions, that are not connection based.

We don’t like UDFs. You need to compile them for every new version. Puppetizing their setup is not fun. We wouldn’t like maintaining this. We wouldn’t like doing the operations for this. Neither would the community.

We want to make this a community solution. Can we do without UDF?

Rewriting MySQL

We wish to avoid forking our own version of MySQL. It’s not what we do and it’s a pain.

A pure MySQL solution?

We found a solution to embrace; it is optimistic, and safe. hat optimistic means is explained further on, but let’s discuss safe:

The previous solution we came up with as unsafe because breakage of a single component in the algorithm would lead to inconsistent data. The algorithm itself was fine, as long as no one would break it from the outside. This is the concern: what if some crazy cronjob that cleans up connections (kills idle connections, kills long running transactions) or some unfortunate user command kills one of the connections involved in the cut-over phase? This is not something that would happen every day, but can we protect against it? Our priority is to keep our data intact.

The solution allows breakage. Even in the face of death of connections, data is not lost/corrupted, and at worst — causes a FB-like, recoverable outage scenario.

A step towards the solution, a flawed one

I wish to illustrate something that looks like it would work, but in fact has a hidden flaw. We will later improve on that solution.

Let’s assume we have tbl, ghost tables. We execute the following by multiple connections; we call them C1, C2, C3, …:

C1: lock tables tbl write;
C2, C3, …, C17: normal app connections, issuing insert, delete, update on tbl. Because of the lock, they are naturally blocked.
We apply those last event we need to apply onto ghost. No new events are coming our way because tbl is blocked.
C18: rename table tbl to tbl_old, ghost to tbl; (blocked as well)
C1: unlock tables; (everything gets released)

Let’s consider the above, and see why it is flawed. But first, why it would typically work in the first place.

Connections C2, …, C17 came first, and C18 came later. Nevertheless MySQL prioritizes C18 and moves it up the queue of waiting queries on tbl. When we unlock, C18 is the first to execute.
We only issue the rename once we’re satisfied we’ve applied those changes. We only unlock once we’re satisfied that the rename has been executed.
If for some reason C1 disconnects before we issue the rename – no problem, we just retry from scratch.

What’s the flaw?

We rename when C1 holds the lock. We check with C1 that it is alive and kicking. Yep, it’s connected and holding the lock. Are you sure? Yep, I’m good! Really really sure? Yep! OK then, let’s rename!

“Oh darn”, says C1, “now that you went ahead to rename, but just before you actually sent the request, I decided to take time off and terminate”. Or, more realistically, some job would kill C1.

What happens now? The rename is not there yet. All those queries get released, and are immediately applied onto tbl, and then the rename applies, kicks all those changes into oblivion, and puts ghost in place, where it immediately receives further writes.

Those blocking queries were committed but never to be seen again.

So here’s another way to look at the problem: the rename made it through even though the connection C1 died just prior to that, whereas we would have loved the rename to abort upon such case.

Is there a way in MySQL to cause an operation to fail or block when another connection dies? It’s the other way around! Connections hold locks, and those get released when they die!

But there’s a way…

Three step, safe, optimistic solution

Here are the steps to a safe solution:

C1: lock tables tbl write;
C2, C3, …, C17: normal app connections, issuing insert, delete, update on tbl. Because of the lock, they are naturally blocked.
We apply those last event we need to apply onto ghost. No new events are coming our way because tbl is blocked.
C18: checking that C1 is still alive, then rename table tbl to tbl_old
C19: checking to see that C18’s rename is in place (via show processlist), and that C1 is still alive; then issues: rename table ghost to tbl
(meanwhile more queries approach tbl, it doesn’t matter, they all get deprioritized, same as C2…C17)
C1: unlock tables

What just happened? Let’s first explain some stuff:

C18’s rename gets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables.
C18 checks C1 is still alive, but as before, there’s always the chance C1 will die just at the wrong time — we’re going to address that.
C19 is interested to see that C18 began execution, but potentially C18 will crash by the time C19 actually issues its own rename — we’re going to address that
C19’s query sounds weird. At that time tbl still exists. You’d expect it to fail immediately — but it does not. It’s valid. This is because tbl‘s metadata lock is in use.
C19 gets prioritized over all the DMLs, but is known to be behind C18. The two stay in same order of arrival. So, C18 is known to execute before C19.
When C1 unlocks, C18 executes first.
Metadata lock is still in place on tbl even though it doesn’t actually exist, because of C19.
C19 operates next.
Finally all the DMLs execute.

What happens on failures?

If C1 dies just as C18 is about to issue the rename, we get an outage: tbl is renamed to tbl_old, and the queries get released and complain the table is just not there.
- C19 will not initiate because it is executed after C18 and checks that C1 is alive — which turns to be untrue.
- So we know we have outage, and we quickly rename tbl_old to tbl; and go drink coffee, then begin it all again.
- The outage is unfortunate, but does not put our data in danger.
If C1 happens to die just as C19 is about to issue its rename, there’s no data integrity: at this point we’ve already asserted the tables are in sync. As C1 dies, C18 will immediately rename tbl to tbl_old. An outage will occur, but not for long, because C19 will next issue rename ghost to tbl, and close the gap. We suffered a minor outage, but no rollback. We roll forward.
If C18 happens to die just as C19 is about to issue its rename, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts to rename ghost onto tbl, but tbl exists and the query fails. The metadata lock is released and all the queries resume operation on the original tbl. The operation failed but without error. We will need to try the entire cycle again.
If both C1 and C18 fail at the time C19 is about to begin its rename, same as above.
If C18 fails as C19 is already in place, same as above.
If C1 fails as C19 is already in place, it’s as good as having it issue the unlock tables. We’re happy.
If C19 fails at any given point, we suffer outage. We revert by rename tbl_old to tbl

This solution relies on the notion that if a previous connection failed, we would not be able to rename ghost to tbl because the table would still be there. That’s what we were looking for; but instead of looking at locks, which get released when a connection terminates, we used a persistent entity: a table.

Conclusion

The algorithm above is optimistic: if no connections get weirdly killed, it’s a valid locking solution, and queries & app are unaware that anything happened (granted, app will notice write latency). If connections do get weirdly killed, we get table-outage at worst case — an outage that is already considered to be a valid solution anyhow. The algorithm will not allow data corruption.

4 thoughts on “Solving the non-atomic table swap, Take II”

shlomi says:

June 20, 2016 at 8:44 pm

It’s worth noting that on replicas this is a non atomic two-step swap.
On replica, there will be no writes on the table in between the two RENAMEs, but there may potentially be other writes from other tables in between those RENAMEs.
Justin Swanhart says:

June 20, 2016 at 10:46 pm

How about using an EVENT that does a ‘SELECT * FROM TABLE FOR UPDATE’ then sits waiting for a GET_LOCK lock to show up and releases the SELECT lock when it does?

This way you can:
event: select * from X for update
event: loop (with short sleep()) until IS_USED_LOCK(‘release_lock_X’);
rollback;

asynch process:
finish apply
rename table a to b; –blocks on SELECT FOR UPDATE
in second connection GET_LOCK(‘release_lock_X’)
rename completes
Justin Swanhart says:

June 20, 2016 at 10:55 pm

You can of course do lock table write instead of select count(*) for update, but I am not sure if innodb always honors lock table write… I seem to recall an option that prevents that, but I can’t remember it off the top of my head.
shlomi says:

June 21, 2016 at 10:22 am

@Justin,

– The `select * for update` may take long minutes to potentially a couple hours to merely grab the locks; meanwhile it will pollute the master’s buffer pool (though I agree an ongoing migration also pollutes it, just not all at the same time)

– I thought about relying on events, but that means I must have `event_scheduler` enabled. Seems fine, but I want our tool to go public and be used by everyone, so maybe some people turn off their scheduler, and if the tool happens to turn it on, bad things happen.

– And still some `pt-kill` might kill the event_scheduler-exeuted connection. What happens if my `rename table a to b` executes 1ms _after_ said connection was killed? Events have been applied to the original table, and we just shoved them aside…

This site uses Akismet to reduce spam. Learn how your comment data is processed.