{"id":7046,"date":"2014-10-23T12:37:17","date_gmt":"2014-10-23T10:37:17","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=7046"},"modified":"2014-10-23T12:37:17","modified_gmt":"2014-10-23T10:37:17","slug":"refactoring-replication-topology-with-pseudo-gtid","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/refactoring-replication-topology-with-pseudo-gtid","title":{"rendered":"Refactoring replication topology with Pseudo GTID"},"content":{"rendered":"<p>This post describes in detail the method of using\u00a0Pseudo GTID to achieve unplanned replication topology changes, i.e. connecting two arbitrary slaves, or recovering from a master failure even as all its slaves are hanging in different positions.<\/p>\n<p>Please read <a href=\"http:\/\/code.openark.org\/blog\/mysql\/pseudo-gtid\">Pseudo GTID<\/a> and <a href=\"http:\/\/code.openark.org\/blog\/mysql\/pseudo-gtid-row-based-replication\">Pseudo GTID, RBR<\/a> as introduction.<\/p>\n<p>Consider the following case: the master dies unexpectedly, and its three slaves are all hanging, not necessarily at same binary log file\/position (network broke down while some slaves managed to salvage more entries into their relay logs than others)<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2014\/10\/orchestrator-failed-master.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7059\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2014\/10\/orchestrator-failed-master.jpeg\" alt=\"orchestrator-failed-master\" width=\"801\" height=\"365\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2014\/10\/orchestrator-failed-master.jpeg 801w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2014\/10\/orchestrator-failed-master-300x136.jpeg 300w\" sizes=\"auto, (max-width: 801px) 100vw, 801px\" \/><\/a><\/p><\/blockquote>\n<p>(Did you notice the <strong>&#8220;Candidate for master&#8221;<\/strong> message? To be discussed shortly)<\/p>\n<h4>GTID<\/h4>\n<p>With GTID each transaction (and entry in the binary log) is associated with a unique mark &#8212; the Global Transaction ID. Just pick the slave with the most advanced GTID to be the next master, and just <strong>CHANGE MASTER TO MASTER_HOST=&#8217;chosen_slave&#8217;<\/strong> on the other slaves, and everything magically works. A slave knows which GTID it has already processed, and can look that entry on its master&#8217;s binary logs, resuming replication on the one that follows.<\/p>\n<p>How does that work? The master&#8217;s binary logs are searched for that GTID entry. I&#8217;m not sure how brute-force this is, since I&#8217;m aware of a subtlety which requires brute-force scan of all binary logs; I don&#8217;t actually know if it&#8217;s always like that.<\/p>\n<h4>Pseudo GTID<\/h4>\n<p>We can mimick that above, but our solution can&#8217;t be as fine grained. With the injection of Pseudo GTID we mark the binary log for unique entries. But instead of having a unique identifier for every entry, we have a unique identifier for every second, 10 seconds, or what have you, with otherwise normal, non-unique entries in between our Pseudo GTID entries.<\/p>\n<h4>Recognizing which slave is more up to date<\/h4>\n<p>Given two slaves, which is more up to date?<\/p>\n<ul>\n<li>If both replicate(d) from same master, a <strong>SHOW SLAVE STATUS<\/strong> comparison answers (safe method: wait till SQL thread catches up with broken IO thread, compare <strong>relay_master_log_file<\/strong>, <strong>exec_master_log_pos<\/strong> on both machines). This is the method by which the above &#8220;Candidate for master&#8221; message is produced.<\/li>\n<li>If one is\/was descendent of the other, then obviously it is less advanced or equals its ancestor.<\/li>\n<li>Otherwise we&#8217;re unsure &#8211; still solvable via bi-directional trial &amp; error, as explained later on.<\/li>\n<\/ul>\n<p>For now, let&#8217;s assume we know which slave is more up to date (has received and executed more relay logs). Let&#8217;s call it <strong>S1<\/strong>, whereas the less up-to-date will be <strong>S2<\/strong>. This will make our discussion simpler.<!--more--><\/p>\n<h4>Prerequisites<\/h4>\n<ul>\n<li>We require a Pseudo GTID in place: a periodic injection of a known-to-be-unique query, and which we know how to intercept<\/li>\n<li>We require <strong>log_slave_updates<\/strong> on all slaves. We will need to parse the binary logs on slaves. I have little trust in the availability of relay logs: these are flushed, rotated and auto-erased all too quickly. The proposed solution does not require any daemon running on the MySQL servers themselves. There will be nothing to back up the relay logs, so I can&#8217;t trust these to exist. Binary logs, on the other hand, have expiry period in days, and so I can trust them to exist for a duration of a few minutes.<\/li>\n<li>Normal replication. Not multi threaded. Not multi-source.<\/li>\n<\/ul>\n<h4>The process of rematching slaves<\/h4>\n<p><strong>S1<\/strong> is more up to date, hence we want to make <strong>S2<\/strong> a slave of <strong>S1<\/strong>. We expect the statements\/entries found in <strong>S2<\/strong>&#8216;s binary logs to exist in <strong>S1<\/strong>, in the same order, but somewhere back in the past, padded by additional entries (zero or more) that are not found in <strong>S2<\/strong>. Steps are:<\/p>\n<ul>\n<li>Find latest Pseudo-GTID entry in <strong>S2<\/strong>&#8216;s logs. This can be done by iterating <strong>S2<\/strong>&#8216;s binary logs newest to oldest. The first (time DESC) binary log where such entry is found is to be searched for the last entry (latest). Keep record of the binlog <strong>file2:pos2<\/strong> coordinates.<\/li>\n<li>Take note of the exact entry made in the above. This is the unique value.<\/li>\n<li>Search said unique value in <strong>S1<\/strong>&#8216;s binary logs. Since it is unique, your method of search is arbirtary, you just need to find it. Brute-force wise you start looking at newest binary log moving back in time. Not found? Unlikely, since this means the lag diff between <strong>S1<\/strong> and <strong>S2<\/strong> is as long as the binlog expiry. We will be handling with failures and with immediate actions; we can expect slave lags in the seconds or in the minutes &#8211; we don&#8217;t even consider the possibility where the entry is not found.<\/li>\n<li>Take note of the coordinates <strong>file1:pos1<\/strong> in <strong>S1<\/strong> where we found the unique value.<\/li>\n<li>We now iterate <strong>S2<\/strong>&#8216;s binary logs starting with the Pseudo GTID <strong>file2:pos2<\/strong>. We expect to find each entry in <strong>S1<\/strong>&#8216;s binary logs, successively, starting <strong>file1:pos1<\/strong>. We verify the entries in both servers are identical. Exceptions above could be:\n<ul>\n<li>Meta-entries (start-of-log, end-of-log, shutdown), in which case we skip to the next entry (this is done in both <strong>S1<\/strong> and <strong>S2<\/strong>)<\/li>\n<li>Local statements executed directly on either <strong>S1<\/strong> or <strong>S2<\/strong>, such as <strong>ANALYZE TABLE<\/strong> or whatever, which make no impact on data &#8212; we may skip these<\/li>\n<li>Local, evil statements executed directly on the slaves,which make for data impact (<strong>INSERT<\/strong>, <strong>DELETE<\/strong>, &#8230;). We choose to fail the operation in such case<\/li>\n<\/ul>\n<\/li>\n<li>After all entries in <strong>S2<\/strong> (matched by entries in <strong>S1<\/strong>) are iterated, our <strong>S1<\/strong> &#8220;cursor&#8221; now looks at the first statement that never made it to <strong>S2<\/strong>. This is <strong>file_win:pos_win<\/strong>, into which we will point <strong>S2<\/strong>.<\/li>\n<li>Or, we might find that upon iterating all entries in <strong>S2<\/strong> we have exactly reached the end of binlog entries for S1: this means both S1 and S2 are actually in identical state. We point <strong>S2<\/strong> into <strong>S1<\/strong>&#8216;s next-binlog-position.\n<ul>\n<li>This is in fact no different than the previous case, but of particular interest.<\/li>\n<\/ul>\n<\/li>\n<li>Or, we might run out of entries in <strong>S1<\/strong>. No, we can&#8217;t, because our assumption was that <strong>S1<\/strong> is more advanced than (or equally advanced as) <strong>S2<\/strong>. But this answers the question: &#8220;<strong>what if didn&#8217;t know in advance who&#8217;s more advanced?<\/strong>&#8221; (no pun intended). In such case we conclude <strong>S2<\/strong> is actually more advanced than <strong>S1<\/strong> and we can try the other way around.<\/li>\n<\/ul>\n<p>That last bullet is of importance: if you have two slaves whose &#8220;family connection&#8221; is complex, you can still match one below the other; you may try one way and fail, then try the other way around and succeed.<\/p>\n<p>Comparison of the events following the Pseudo-GTID is a good way of sanity checking (some meta-stuff should be ignored, like transaction IDs, table IDs, these can vary across servers), and builds up confidence in the correctness of the operation.<\/p>\n<p>The codebase is actually complete and pushed; I&#8217;ll release a BETA version or <a href=\"https:\/\/github.com\/outbrain\/orchestrator\">orchestrator<\/a> next week, that supports Pseudo GTID. Let me tell you, doing this kind of crazy stuff with visual feedback (of course command line is available) is very very cool.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post describes in detail the method of using\u00a0Pseudo GTID to achieve unplanned replication topology changes, i.e. connecting two arbitrary slaves, or recovering from a master failure even as all its slaves are hanging in different positions. Please read Pseudo GTID and Pseudo GTID, RBR as introduction. Consider the following case: the master dies unexpectedly, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[113,108,115,8],"class_list":["post-7046","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-gtid","tag-orchestrator","tag-pseudo-gtid","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1PE","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7046","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=7046"}],"version-history":[{"count":15,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7046\/revisions"}],"predecessor-version":[{"id":7113,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7046\/revisions\/7113"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=7046"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=7046"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=7046"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}