{"id":7567,"date":"2016-06-20T11:26:47","date_gmt":"2016-06-20T09:26:47","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=7567"},"modified":"2016-06-20T14:29:33","modified_gmt":"2016-06-20T12:29:33","slug":"solving-the-non-atomic-table-swap-take-ii","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/solving-the-non-atomic-table-swap-take-ii","title":{"rendered":"Solving the non-atomic table swap, Take II"},"content":{"rendered":"<p>Following up and improving on\u00a0<a title=\"Link to Solving the Facebook-OSC non-atomic table swap problem\" href=\"http:\/\/code.openark.org\/blog\/mysql\/solving-the-facebook-osc-non-atomic-table-swap-problem\" rel=\"bookmark\">Solving the Facebook-OSC non-atomic table swap problem<\/a>, we present a better, safe solution.<\/p>\n<h3>Quick, quickest recap:<\/h3>\n<p>We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the <a href=\"https:\/\/www.facebook.com\/notes\/mysql-at-facebook\/online-schema-change-for-mysql\/430801045932\/\">FB osc<\/a> and as opposed to the synchronous solution as used by\u00a0<a href=\"https:\/\/www.percona.com\/doc\/percona-toolkit\/2.2\/pt-online-schema-change.html\">pt-online-schema-change<\/a>.<\/p>\n<p>We asynchronously synchronize (is that even a valid statement?) between some table <strong>tbl<\/strong> and a ghost table\u00a0<strong>ghost<\/strong>, and at some time we want to cut-over: swap the two; kick out\u00a0<strong>tbl<\/strong> and put\u00a0<strong>ghost<\/strong> in its place and under its name.<\/p>\n<p>However, we cannot use the single statement <strong>rename tbl to tbl_old, ghost to tbl<\/strong>, because we use the asynchronous approach, where at the time we lock\u00a0<strong>tbl<\/strong> for writes, we still have some events we need to process and apply onto\u00a0<strong>ghost<\/strong> before swapping the two.<\/p>\n<p>And MySQL does not allow a <strong>lock tables tbl write; &#8230; ;\u00a0<\/strong><strong>rename tbl to tbl_old, ghost to tbl<\/strong>.<\/p>\n<p>In\u00a0<a title=\"Link to Solving the Facebook-OSC non-atomic table swap problem\" href=\"http:\/\/code.openark.org\/blog\/mysql\/solving-the-facebook-osc-non-atomic-table-swap-problem\" rel=\"bookmark\">Solving the Facebook-OSC non-atomic table swap problem<\/a>\u00a0we suggested a way that works, unless when it doesn&#8217;t work. Read the caveat at the end of the post. Premature death of a connection that participates in the algorithm causes a chain reaction that leads to the premature execution of the\u00a0<strong>rename<\/strong> statement, potentially before we&#8217;ve applied those remaining events. This leads to data inconsistency between the old table and the new table, and is unacceptable.<\/p>\n<p>To that effect, we were more inclined to go with the Facebook solution, which makes a two-step:\u00a0<strong>lock tables tbl write; alter table tbl rename to tbl_old; &#8230; ; alter table ghost rename to tbl;<\/strong><\/p>\n<p>This two-step solution is guaranteed not to have data inconsistency. Alas, it also implies an outage. There&#8217;s a brief moment, in between the two\u00a0<strong>rename<\/strong>s, and during that time where we apply those last changes, where the table\u00a0<strong>tbl<\/strong> is simply not there.<\/p>\n<p>Not all applications will fail gracefully on such a scenario.<!--more--><\/p>\n<h3>UDF<\/h3>\n<p>We looked at a solution based on UDFs, where we would create global wait conditions, that are not connection based.<\/p>\n<p>We don&#8217;t like UDFs. You need to compile them for every new version. Puppetizing their setup is not fun. We wouldn&#8217;t like maintaining this. We wouldn&#8217;t like doing the operations for this. Neither would the community.<\/p>\n<p>We want to make this a community solution. Can we do without UDF?<\/p>\n<h3>Rewriting MySQL<\/h3>\n<p>We wish to avoid forking our own version of MySQL. It&#8217;s not what we do and it&#8217;s a pain.<\/p>\n<h3>A pure MySQL solution?<\/h3>\n<p>We found a solution to embrace; it is <em>optimistic<\/em>, and <em>safe<\/em>. hat <em>optimistic<\/em> means is explained further on, but let&#8217;s discuss <em>safe<\/em>:<\/p>\n<p>The previous solution we came up with as <em>unsafe<\/em> because breakage of a single component in the algorithm would lead to inconsistent data. The algorithm itself was fine, as long as no one would break it from the outside. This is the concern: what if some crazy cronjob that cleans up connections (kills idle connections, kills long running transactions) or some unfortunate user command kills one of the connections involved in the cut-over phase? This is not something that would happen every day, but can we protect against it? Our priority is to keep our data intact.<\/p>\n<p>The solution allows breakage. Even in the face of death of connections, data is not lost\/corrupted, and at worst &#8212; causes a FB-like, recoverable outage scenario.<\/p>\n<h3>A step towards the solution, a\u00a0flawed one<\/h3>\n<p>I wish to illustrate something that looks like it would work, but in fact has a hidden flaw. We will later improve on that solution.<\/p>\n<p>Let&#8217;s assume we have\u00a0<strong>tbl<\/strong>,\u00a0<strong>ghost<\/strong> tables. We execute the following by multiple connections; we call them C1, C2, C3, &#8230;:<\/p>\n<ul>\n<li>C1:\u00a0<strong>lock tables tbl write;<\/strong><\/li>\n<li>C2, C3, &#8230;, C17: normal app connections, issuing\u00a0<strong>insert, delete, update<\/strong> on\u00a0<strong>tbl<\/strong>. Because of the lock, they are naturally blocked.<\/li>\n<li>We apply those last event we need to apply onto\u00a0<strong>ghost<\/strong>. No new events are coming our way because\u00a0<strong>tbl<\/strong> is blocked.<\/li>\n<li>C18:\u00a0<strong>rename table tbl to tbl_old, ghost to tbl; <\/strong>(blocked as well)<\/li>\n<li>C1:\u00a0<strong>unlock tables<\/strong><strong>;\u00a0<\/strong>(everything gets released)<\/li>\n<\/ul>\n<p>Let&#8217;s consider the above, and see why it is flawed. But first, why it would typically work in the first place.<\/p>\n<ul>\n<li>Connections C2, &#8230;, C17 came first, and C18 came later. Nevertheless MySQL prioritizes C18 and moves it up the queue of waiting queries on\u00a0<strong>tbl<\/strong>. When we\u00a0<strong>unlock<\/strong>, C18 is the first to execute.<\/li>\n<li>We only issue the <strong>rename<\/strong> once we&#8217;re satisfied we&#8217;ve applied those changes. We only\u00a0<strong>unlock<\/strong> once we&#8217;re satisfied that the\u00a0<strong>rename<\/strong> has been executed.<\/li>\n<li>If for some reason C1 disconnects before we issue the\u00a0<strong>rename<\/strong> &#8211; no problem, we just\u00a0retry from scratch.<\/li>\n<\/ul>\n<h4>What&#8217;s the flaw?<\/h4>\n<p>We\u00a0<strong>rename<\/strong> when C1 holds the\u00a0<strong>lock<\/strong>. We check with C1 that it is alive and kicking. Yep, it&#8217;s connected and holding the lock. Are you sure? Yep, I&#8217;m good! Really really sure? Yep! OK then, let&#8217;s\u00a0<strong>rename!<\/strong><\/p>\n<p>&#8220;Oh darn&#8221;, says C1, &#8220;now that you went ahead to\u00a0<strong>rename<\/strong>, but just before you actually sent the request, I decided to take time off and terminate&#8221;. Or, more realistically, some job would kill C1.<\/p>\n<p>What happens now? The\u00a0<strong>rename<\/strong> is not there yet. All those queries get released, and are immediately applied onto\u00a0<strong>tbl<\/strong>, and\u00a0<em>then<\/em> the\u00a0<strong>rename<\/strong> applies, kicks all those changes into oblivion, and puts\u00a0<strong>ghost<\/strong> in place, where it immediately receives further writes.<\/p>\n<p>Those blocking queries were committed but never to be seen again.<\/p>\n<p>So here&#8217;s\u00a0another way to look at the problem: the\u00a0<strong>rename<\/strong> made it through even though the connection C1 died just prior to that, whereas we would have loved the\u00a0<strong>rename<\/strong> to abort upon such case.<\/p>\n<p>Is there a way in MySQL to cause an operation to <strong>fail or block<\/strong> when another connection dies? It&#8217;s the other way around! Connections hold locks, and those get released when they die!<\/p>\n<p>But there&#8217;s a way&#8230;<\/p>\n<h3>Three step, safe, optimistic solution<\/h3>\n<p>Here are the steps to a safe solution:<\/p>\n<ul>\n<li>C1:\u00a0<strong>lock tables tbl write;<\/strong><\/li>\n<li>C2, C3, &#8230;, C17: normal app connections, issuing\u00a0<strong>insert, delete, update<\/strong> on\u00a0<strong>tbl<\/strong>. Because of the lock, they are naturally blocked.<\/li>\n<li>We apply those last event we need to apply onto\u00a0<strong>ghost<\/strong>. No new events are coming our way because\u00a0<strong>tbl<\/strong> is blocked.<\/li>\n<li>C18: checking that C1 is still alive, then\u00a0<strong>rename table tbl to tbl_old<\/strong><\/li>\n<li>C19: checking to see that C18&#8217;s\u00a0<strong>rename<\/strong> is in place (via\u00a0<strong>show processlist<\/strong>), <strong>and<\/strong> that C1 is still alive;\u00a0then issues:\u00a0<strong>rename table ghost to tbl<\/strong><\/li>\n<li>(meanwhile more queries approach\u00a0<strong>tbl<\/strong>, it doesn&#8217;t matter, they all get deprioritized, same as C2&#8230;C17)<\/li>\n<li>C1:\u00a0<strong>unlock tables<\/strong><\/li>\n<\/ul>\n<p>What just happened? Let&#8217;s first explain some stuff:<\/p>\n<ul>\n<li>C18&#8217;s\u00a0<strong>rename<\/strong> gets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables.<\/li>\n<li>C18 checks C1 is still alive, but as before, there&#8217;s always the chance C1 will die just at the wrong time &#8212; we&#8217;re going to address that.<\/li>\n<li>C19 is interested to see that C18 began execution, but potentially C18 will crash by the time C19 actually issues its own\u00a0<strong>rename<\/strong> &#8212; we&#8217;re going to address that<\/li>\n<li>C19&#8217;s query sounds weird. At that time\u00a0<strong>tbl<\/strong> still exists. You&#8217;d expect it to fail immediately &#8212; but it does not. It&#8217;s valid. This is because <strong>tbl<\/strong>&#8216;s metadata lock is in use.<\/li>\n<li>C19 gets prioritized over all the DMLs, but is known to be behind C18. The two stay in same order of arrival. So, C18 is known to execute before C19.<\/li>\n<li>When C1 unlocks, C18 executes first.<\/li>\n<li>Metadata lock is still in place on\u00a0<strong>tbl<\/strong> even though it doesn&#8217;t actually exist, because of C19.<\/li>\n<li>C19 operates next.<\/li>\n<li>Finally all the DMLs execute.<\/li>\n<\/ul>\n<p>What happens on failures?<\/p>\n<ul>\n<li>If C1 dies just as C18 is about to issue the\u00a0<strong>rename<\/strong>, we get an outage:\u00a0<strong>tbl<\/strong> is renamed to\u00a0<strong>tbl_old<\/strong>, and the queries get released and complain the table is just not there.\n<ul>\n<li>C19 will not initiate because it is executed\u00a0<strong>after<\/strong> C18 and checks that C1 is alive &#8212; which turns to be untrue.<\/li>\n<li>So we\u00a0<strong>know<\/strong> we have outage, and we quickly\u00a0<strong>rename tbl_old to tbl;<\/strong> and go drink coffee, then begin it all again.<\/li>\n<li>The outage is unfortunate, but does not put our data in danger.<\/li>\n<\/ul>\n<\/li>\n<li>If C1 happens to die just as C19 is about to issue its\u00a0<strong>rename<\/strong>, there&#8217;s no data integrity: at this point we&#8217;ve already asserted the tables are in sync. As C1 dies, C18 will immediately rename\u00a0<strong>tbl<\/strong> to\u00a0<strong>tbl_old<\/strong>. An outage will occur, but not for long, because C19 will next issue\u00a0<strong>rename ghost to tbl<\/strong>, and close the gap. We suffered a minor outage, but no rollback. We roll forward.<\/li>\n<li>If C18 happens to die just as C19 is about to issue its\u00a0<strong>rename<\/strong>, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts to rename\u00a0<strong>ghost<\/strong> onto\u00a0<strong>tbl<\/strong>, but\u00a0<strong>tbl<\/strong> exists and the query fails. The metadata lock is released and all the queries resume operation on the original\u00a0<strong>tbl<\/strong>. The operation failed but without error. We will need to try the entire cycle again.<\/li>\n<li>If both C1 and C18 fail at the time C19 is about to begin its\u00a0<strong>rename<\/strong>, same as above.<\/li>\n<li>If C18 fails as C19 is already in place, same as above.<\/li>\n<li>If C1 fails as C19 is already in place, it&#8217;s as good as having it issue the\u00a0<strong>unlock tables<\/strong>. We&#8217;re happy.<\/li>\n<li>If C19 fails at any given point, we suffer outage. We revert by <code>rename tbl_old to tbl<\/code><\/li>\n<\/ul>\n<p>This solution relies on the notion that if a previous connection failed, we would not be able to <strong>rename ghost to tbl<\/strong> because the table would still be there. That&#8217;s what we were looking for; but instead of looking at locks, which get released when a connection terminates, we used a persistent entity: a table.<\/p>\n<h3>Conclusion<\/h3>\n<p>The algorithm above is <strong>optimistic<\/strong>: if no connections get weirdly killed, it&#8217;s a valid locking solution, and queries &amp; app are unaware that anything happened (granted, app will notice write latency). If connections do get weirdly killed, we get table-outage at worst case &#8212; an outage that is already considered to be a valid solution anyhow.\u00a0The algorithm\u00a0will not allow data corruption.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Following up and improving on\u00a0Solving the Facebook-OSC non-atomic table swap problem, we present a better, safe solution. Quick, quickest recap: We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the FB osc and as opposed to the synchronous solution as used by\u00a0pt-online-schema-change. We asynchronously synchronize [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[57,121],"class_list":["post-7567","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-open-source","tag-operations"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1Y3","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7567","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=7567"}],"version-history":[{"count":14,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7567\/revisions"}],"predecessor-version":[{"id":7582,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7567\/revisions\/7582"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=7567"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=7567"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=7567"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}