{"id":7740,"date":"2017-08-03T10:41:11","date_gmt":"2017-08-03T08:41:11","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=7740"},"modified":"2017-08-03T15:32:59","modified_gmt":"2017-08-03T13:32:59","slug":"orchestratorraft-pre-release-3-0","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/orchestratorraft-pre-release-3-0","title":{"rendered":"orchestrator\/raft: Pre-Release 3.0"},"content":{"rendered":"<p><code>orchestrator<\/code> <strong>3.0 Pre-Release<\/strong> is <a href=\"https:\/\/github.com\/github\/orchestrator\/releases\/tag\/v3.0.pre-release\">now available<\/a>. Most notable are <strong>Raft<\/strong> consensus, <strong>SQLite<\/strong> backend support, <strong>orchestrator-client<\/strong> no-binary-required client script.<\/p>\n<h3>TL;DR<\/h3>\n<p>You may now set up high availability for <code>orchestrator<\/code> via <code>raft<\/code> consensus, without need to set up high availability for <code>orchestrator<\/code>&#8216;s backend MySQL servers (such as Galera\/InnoDB Cluster). In fact, you can run a <code>orchestrator\/raft<\/code> setup using embedded <code>SQLite<\/code> backend DB. Read on.<\/p>\n<p><code>orchestrator<\/code> still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to <strong>3.0<\/strong> and do not configure <code>raft<\/code>.<\/p>\n<h3>orchestrator\/raft<\/h3>\n<p><a href=\"https:\/\/raft.github.io\/\">Raft<\/a> is a consensus protocol, supporting leader election and consensus across a distributed system.\u00a0 In an <code>orchestrator\/raft<\/code> setup <code>orchestrator<\/code> nodes talk to each other via raft protocol, form consensus and elect a leader. Each <code>orchestrator<\/code> node has its own <em>dedicated<\/em> backend database. The backend databases do not speak to each other; only the <code>orchestrator<\/code> nodes speak to each other.<\/p>\n<p>No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn&#8217;t have to be MySQL, and <code>SQLite<\/code> is supported. <code>orchestrator<\/code> now ships with <code>SQLite<\/code> embedded, no external dependency needed.<!--more--><\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/orchestrator-ha-raft.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-7743\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/orchestrator-ha-raft.png\" alt=\"\" width=\"824\" height=\"326\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/orchestrator-ha-raft.png 824w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/orchestrator-ha-raft-300x119.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/orchestrator-ha-raft-768x304.png 768w\" sizes=\"auto, (max-width: 824px) 100vw, 824px\" \/><\/a><\/p><\/blockquote>\n<p>In a <code>orchestrator\/raft<\/code> setup, all <code>orchestrator<\/code> nodes talk to each other. One and only one is elected as <em>leader<\/em>. To become a leader a node must be part of a <em>quorum<\/em>. On a <code>3<\/code> node setup, it takes <code>2<\/code> nodes to form a quorum. On a <code>5<\/code> node setup, it takes <code>3<\/code> nodes to form a quorum.<\/p>\n<p>Only the leader will run failovers. This much is similar to the existing shared-backend DB setup. However in a <code>orchestrator\/raft<\/code> setup each node is independent, and each <code>orchestrator<\/code> node <em>runs discoveries<\/em>. This means a MySQL server in your topology will be routinely visited and probed by not one <code>orchestrator<\/code> node, but by all <code>3<\/code> (or <code>5<\/code>, or what have you) nodes in your raft cluster.<\/p>\n<p>Any communication to <code>orchestrator<\/code> must take place through the leader. One may not tamper directly with the backend DBs anymore, since the <code>leader<\/code> is the one authoritative entity to replicate and announce changes to its peer nodes. See <strong>orchestrator-client<\/strong> section following.<\/p>\n<p>For details, please refer to the documentation:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/raft.md\">orchestrator\/raft: overview<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/raft-vs-sync-repl.md\">orchestrator\/raft vs. shared backend DB setup, comparison<\/a><\/li>\n<\/ul>\n<p>The <code>orchetrator\/raft<\/code> setup comes to solve several issues, the most obvious is high availability for the <code>orchestrator<\/code> service: in a <code>3<\/code> node setup any single <code>orchestrator<\/code> node can go down and <code>orchestrator<\/code> will reliably continue probing, detecting failures and recovering from failures.<\/p>\n<ul>\n<li>See all <a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/high-availability.md\">orchestrator high availability solutions<\/a><\/li>\n<\/ul>\n<p>Another issue solve by <code>orchestrator\/raft<\/code> is network isolation, in particularly cross-DC, also refered to as <em>fencing<\/em>. Some visualization will help describe the issue.<\/p>\n<p>Consider this 3 data-center replication setup. The master, along with a few replicas, resides on <strong>DC1<\/strong>. Two additional DCs have intermediate masters, aka local-masters, that relay replication to local replicas.<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-7752 size-medium\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-1-300x252.png\" alt=\"\" width=\"300\" height=\"252\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-1-300x252.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-1.png 698w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p><\/blockquote>\n<p>We place <code>3 orchestrator<\/code> nodes in a <code>raft<\/code> setup, each in a different DC. Note that traffic between <code>orchestrator<\/code> nodes is very low, and cross DC latencies still conveniently support the <code>raft<\/code> communication. Also note that backend DB writes have nothing to do with cross-DC traffic and are unaffected by latencies.<\/p>\n<p>Consider what happens if DC1 gets network isolated: no traffic in or out DC1<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-7755\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-300x249.png\" alt=\"\" width=\"300\" height=\"249\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-300x249.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated.png 698w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p><\/blockquote>\n<p>Each <code>orchestrator<\/code> nodes operates independently, and each will see a different state. DC1&#8217;s <code>orchestrator<\/code> will see all servers in DC2, DC3 as dead, but figure the master itself is fine, along with its local DC1 replicas:<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc1-view.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-7756\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc1-view-300x248.png\" alt=\"\" width=\"300\" height=\"248\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc1-view-300x248.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc1-view.png 698w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p><\/blockquote>\n<p>However both <code>orchestrator<\/code> nodes in DC2 and DC3 will see a different picture: they will see all DC1&#8217;s servers as dead, with local masters in DC2 and DC3 having broken replication:<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc2-dc3-view.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-7757\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc2-dc3-view-300x255.png\" alt=\"\" width=\"300\" height=\"255\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc2-dc3-view-300x255.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-dc2-dc3-view.png 698w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p><\/blockquote>\n<p>Who gets to choose?<\/p>\n<p>In the <code>orchestrator\/raft<\/code> setup, only the leader runs failovers. The leader must be part of a quorum. Hence the leader will be an <code>orchestrator<\/code> node in either DC2 or DC3. DC1&#8217;s <code>orchestrator<\/code> will <em>know<\/em> it is isolated, that it isn&#8217;t part of the quorum, hence will step down from leadership (that&#8217;s the premise of the <code>raft<\/code> consensus protocol), hence will not run recoveries.<\/p>\n<p>There will be no split brain in this scenario. The <code>orchestrator<\/code> leader, be it in DC2 or DC3, will act to recover and promote a master from within DC2 or DC3. A possible outcome would be:<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-recovered.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-7758\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-recovered-300x259.png\" alt=\"\" width=\"300\" height=\"259\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-recovered-300x259.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-3dc-dc1-isolated-recovered.png 680w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p><\/blockquote>\n<p>What if you only have 2 data centers?<\/p>\n<p>In such case it is advisable to put two <code>orchestrator<\/code> nodes, one in each of your DCs, and a <em>third<\/em> <code>orchestrator<\/code> node as a mediator, in a 3rd DC, or in a different availability zone. A cloud offering should do well:<\/p>\n<blockquote><p><a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-2dc-mediator.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-7759\" src=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-2dc-mediator-300x254.png\" alt=\"\" width=\"300\" height=\"254\" srcset=\"https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-2dc-mediator-300x254.png 300w, https:\/\/code.openark.org\/blog\/wp-content\/uploads\/2017\/08\/raft-2dc-mediator.png 688w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p><\/blockquote>\n<p>The <code>orchestrator\/raft<\/code> setup plays nice and allows one to <a href=\"https:\/\/github.com\/openark\/raft\/pull\/1\">nominate<\/a> a preferred leader.<\/p>\n<h3>SQLite<\/h3>\n<p>Suggested and requested by many, is to remove <code>orchestrator<\/code>&#8216;s own dependency on a MySQL backend. <code>orchestrator<\/code> now supports a SQLite backend.<\/p>\n<p><code>SQLite<\/code> is a transactional, relational, embedded database, and as of <code>3.0<\/code> it is embedded within <code>orchestrator<\/code>, no external dependency required.<\/p>\n<p><code>SQLite<\/code> doesn&#8217;t replicate, doesn&#8217;t support client\/server protocol. As such, it cannot work as a shared database backend. <code>SQLite<\/code> is only available on:<\/p>\n<ul>\n<li>A single node setup: good for local dev installations, testing server, CI servers (indeed, <code>SQLite<\/code> now runs in <code>orchestrator<\/code>&#8216;s CI)<\/li>\n<li><code>orchestrator\/raft<\/code> setup, where, as noted above, backend DBs do not communicate with each other in the first place and are each dedicated to their own <code>orchestrator<\/code> node.<\/li>\n<\/ul>\n<p>It should be pointed out that <code>SQLite<\/code> is a great transactional database, however <code>MySQL<\/code> is more performant. Load on backend DB is directly (and mostly linearly) affected by the number of probed servers. If you have <code>50<\/code> servers in your topologies or <code>500<\/code> servers, that matters. The probing frequency of course also matters for the write frequency on your backend DB. I would suggest if you have thousands of backend servers, to stick with <code>MySQL<\/code>. If dozens, <code>SQLite<\/code> should be good to go. In between is a gray zone, and at any case run your own experiments.<\/p>\n<p>At this time <code>SQLite<\/code> is configured to commit to file; there is a different setup where <code>SQLite<\/code> places data in-memory, which makes it faster to execute. Occasional dumps required for durability. <code>orchestrator<\/code> may support this mode in the future.<\/p>\n<h3>orchestrator-client<\/h3>\n<p>You install <code>orchestrator<\/code> as a service on a few boxes; but then how do you access it from other hosts?<\/p>\n<ul>\n<li>Either <code>curl<\/code> the <code>orchestrator<\/code> API<\/li>\n<li>Or, as most do, install <code>orchestrator-cli<\/code> package, which includes the <code>orchestrator<\/code> binary, everywhere.<\/li>\n<\/ul>\n<p>The latter implies:<\/p>\n<ul>\n<li>Having the <code>orchestrator<\/code> binary installed everywhere, hence updated everywhere.<\/li>\n<li>Having the <code>\/etc\/orchestrator.conf.json<\/code>deployed everywhere, along with credentials.<\/li>\n<\/ul>\n<p>The <code>orchestrator\/raft<\/code> setup does not support running <code>orchestrator<\/code> in command-line mode. Reason: in this mode <code>orchestrator<\/code> talks directly to the shared backend DB. There is no shared backend DB in the <code>orchestrator\/raft<\/code> setup, and all communication must go through the leader service. This is a change of paradigm.<\/p>\n<p>So, back to <code>curl<\/code>ing the HTTP API. Enter <a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/orchestrator-client.md\"><strong>orchestrator-client<\/strong><\/a> which mimics the command line interface, while running <code>curl | jq<\/code> requests against the HTTP API. <code>orchestrator-client<\/code>, however, is just a shell script.<\/p>\n<p><code>orchestrator-client<\/code> will work well on either <code>orchestrator\/raft<\/code> or on your existing non-raft setups. If you like, you may replace your remote <code>orchestrator<\/code> installations and your <code>\/etc\/orchestrator.conf.json<\/code> deployments with this script. You will need to provide the script with a hint: the <code>$ORCHESTRATOR_API<\/code> environment variable should be set to point to the <code>orchestrator<\/code> HTTP API.<\/p>\n<p>Here&#8217;s the fun part:<\/p>\n<ul>\n<li>You will either have a proxy on top of your <code>orchestrator<\/code> service cluster, and you would <code>export ORCHESTRATOR_API=http:\/\/my.orchestrator.service\/api<\/code><\/li>\n<li>Or you will provide <code>orchestrator-client<\/code> with all <code>orchestrator<\/code> node identities, as in <code>export ORCHESTRATOR_API=\"https:\/\/orchestrator.host1:3000\/api https:\/\/orchestrator.host2:3000\/api https:\/\/orchestrator.host3:3000\/api\"<\/code> .<br \/>\n<code>orchestrator-client<\/code> will <strong>figure the identity of the leader<\/strong> and will forward requests to the leader. At least scripting-wise, you will not require a proxy.<\/li>\n<\/ul>\n<h3>Status<\/h3>\n<p><code>orchestrator 3.0<\/code> is a <strong>Pre-Release<\/strong>. We are running a mostly-passive <code>orchestrator\/raft<\/code> setup in production. It is mostly-passive in that it is not in charge of failovers yet. Otherwise it probes and analyzes our topologies, as well as runs failure detection. We will continue to improve operational aspects of the <code>orchestrator\/raft<\/code> setup (see <a href=\"https:\/\/github.com\/github\/orchestrator\/issues\/246\">this issue<\/a>).<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script. TL;DR You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator&#8216;s backend MySQL servers (such as Galera\/InnoDB Cluster). In fact, you can run a orchestrator\/raft setup [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":true,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[116,62,57,108,128,8],"class_list":["post-7740","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-failover","tag-high-availability","tag-open-source","tag-orchestrator","tag-raft","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-20Q","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7740","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=7740"}],"version-history":[{"count":22,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7740\/revisions"}],"predecessor-version":[{"id":7770,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7740\/revisions\/7770"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=7740"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=7740"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=7740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}