{"id":7833,"date":"2018-01-29T11:40:05","date_gmt":"2018-01-29T09:40:05","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=7833"},"modified":"2018-01-29T23:02:20","modified_gmt":"2018-01-29T21:02:20","slug":"orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/orchestrator-3-0-6-faster-crash-detection-recoveries-auto-pseudo-gtid-semi-sync-and-more","title":{"rendered":"orchestrator 3.0.6: faster crash detection &#038; recoveries, auto Pseudo-GTID, semi-sync and more"},"content":{"rendered":"<p><code>orchestrator<\/code> <a href=\"https:\/\/github.com\/github\/orchestrator\/releases\/tag\/v3.0.6\"><strong>3.0.6<\/strong> is released<\/a> and includes some exciting improvements and features. It quickly follows up on <a href=\"https:\/\/github.com\/github\/orchestrator\/releases\/tag\/v3.0.5\"><strong>3.0.5<\/strong><\/a> released recently, and this post gives a breakdown of some notable changes:<\/p>\n<h3>Faster failure detection<\/h3>\n<p>Recall that <code>orchestrator<\/code> uses a holistic approach for <a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/failure-detection.md#failure-detection\">failure detection<\/a>: it reads state not only from the failed server (e.g. master) but also from its replicas. <code>orchestrator<\/code> now detects failure faster than before:<\/p>\n<ul>\n<li>A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every <code>5sec<\/code>, failure detection time dropped from <code>7-10sec<\/code> to <code>3-5sec<\/code>, <em>keeping reliability<\/em>. The reduction in time does not lead to increased false positives.<br \/>\nSide note: you may see increased not-quite-failure analysis such as &#8220;I can&#8217;t see the master&#8221; (<code>UnreachableMaster<\/code>).<\/li>\n<li>Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, <code>orchestrator<\/code> now observes server discovery asynchronously. We have <a href=\"https:\/\/githubengineering.com\/mysql-testing-automation-at-github\/#failovers\">specialized failover tests<\/a> that simulate dropped packets. The change reduces detection time by some <code>5sec<\/code>.<\/li>\n<\/ul>\n<h3>Faster master recoveries<\/h3>\n<p>Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It&#8217;s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.<\/p>\n<p>With recent changes, <code>orchestrator<\/code> is able to to recognize, early on, that the replica it would like to promote as master is <em>ideal<\/em>. Assuming that is the case, <code>orchestrator<\/code> is able to immediate promote it (i.e. run hooks, set <code>read_only=0<\/code> etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.<\/p>\n<p>This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.<\/p>\n<p>Between faster detection and faster recoveries, we&#8217;re looking at some <code>10sec<\/code> reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at <code>&lt; 20sec<\/code> in almost all cases, and <code>&lt; 15s<\/code> in optimal cases. Those times are measured on our failover tests.<\/p>\n<p>We are working on reducing failover time unrelated to <code>orchestrator<\/code> and hope to update soon.<\/p>\n<h3>Automated Pseudo-GTID<\/h3>\n<p>As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar &#8220;point your replica under any other server&#8221; behavior GTID allows.<!--more--><\/p>\n<p>There&#8217;s still <em>many<\/em> setups out there where GTID is not (yet?) deployed and enabled. However, Pseudo-GTID is often misunderstood, and though I&#8217;ve blogged and presented Pseudo-GTID many times in the past, I still find myself explaining to people the setup is simple and does not involve change to one&#8217;s topologies.<\/p>\n<p>Well, it just got simpler. <code>orchestrator<\/code> is now able to automatically inject Pseudo-GTID for you.<\/p>\n<p>Say the word: <code>\"AutoPseudoGTID\": true<\/code>, grant <a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/configuration-discovery-pseudo-gtid.md#automated-pseudo-gtid-injection\">the necessary privilege<\/a>, and your non-GTID topology is suddenly supercharged with magical Pseudo-GTID tokens that provide you with:<\/p>\n<ul>\n<li>Arbitrary relocation of replicas<\/li>\n<li>Automated or manual failovers (masters <em>and<\/em> intermediate masters)<\/li>\n<li>Vendor freedom: runs on Oracle MySQL, Percona Server, MariaDB, or all of the above at the very same time.<\/li>\n<li>Version freedom (still on <code>5.5<\/code>? No problem. Oh, this gets you crash-safe replication as extra bonus, too)<\/li>\n<\/ul>\n<p>Auto-Pseudo-GTID further simplifies the infrastructure in that you no longer need to take care of injecting Pseudo-GTID onto the master as well as handle master identity changes. No more <code>event_scheduler<\/code> to enable\/disable nor services to <code>start\/stop<\/code>.<\/p>\n<p>More and more setups are moving to GTID. We may, too! But I find it peculiar that Pseudo-GTID was suggested <code>4<\/code> years ago, when <code>5.6<\/code> GTID was already released, and still many setups are not yet running GTID. If you&#8217;re not using GTID, please try Pseudo-GTID! <a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/pseudo-gtid.md\">Read more<\/a>.<\/p>\n<h3>Semi-sync support<\/h3>\n<p>Semi-sync has been internally supported via a specialized patch contributed by Vitess, to flag a server as semi-sync-able and handle enablement of semi-sync upon master failover.<\/p>\n<p><code>orchestrator<\/code> now supports semi-sync more generically. You may use <code>orchestrator<\/code> to enable\/disable semi-sync master\/replica side, via <code>orchestrator -c enable-semi-sync-master<\/code>, <code>orchestrator -c enable-semi-sync-replica<\/code>, <code>orchestrator -c disable-semi-sync-master<\/code>, <code>orchestrator -c disable-semi-sync-replica<\/code> commands (or API equivalent).<\/p>\n<p>The API will also tell you whether semi-sync is enabled on instances. Noteworthy that configured != enabled. A server can be configured with <code>rpl_semi_sync_master_enabled=ON<\/code>, but if no semi-sync replicas are found, the <code>Rpl_semi_sync_master_status<\/code> state is <code>OFF<\/code>.<\/p>\n<h3>More<\/h3>\n<p>UI changes, removal of prepared statements, documentation updates, raft updates&#8230;<\/p>\n<p><a href=\"https:\/\/github.com\/github\/orchestrator\"><code>orchestrator<\/code><\/a> is free and open source and released under the Apache 2 license. It is authored at and used by GitHub.<\/p>\n<p>I&#8217;ll be presenting <code>orchestrator\/raft<\/code> in <a href=\"https:\/\/fosdem.org\/2018\/schedule\/event\/orchestrator_raft\/\">FOSDEM next week<\/a>, at the MySQL and Friends Room.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes: Faster failure detection Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[62,57,108,8],"class_list":["post-7833","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-high-availability","tag-open-source","tag-orchestrator","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-22l","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7833","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=7833"}],"version-history":[{"count":13,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7833\/revisions"}],"predecessor-version":[{"id":7846,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7833\/revisions\/7846"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=7833"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=7833"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=7833"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}