{"id":7869,"date":"2018-05-14T10:08:32","date_gmt":"2018-05-14T08:08:32","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=7869"},"modified":"2018-05-22T10:45:32","modified_gmt":"2018-05-22T08:45:32","slug":"mysql-master-discovery-methods-part-5-service-discovery-proxy","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-5-service-discovery-proxy","title":{"rendered":"MySQL master discovery methods, part 5: Service discovery &#038; Proxy"},"content":{"rendered":"<p>This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.<\/p>\n<p>These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share <code>orchestrator<\/code> specific configuration\/advice, and point out where cross DC <code>orchestrator\/raft<\/code> setup plays part in discovery itself, but for the most part any recovery tool such as <code>MHA<\/code>, <code>replication-manager<\/code>, <code>severalnines<\/code> or other, is applicable.<\/p>\n<p>We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera\/XtraDB Cluster\/InnoDB Cluster).<\/p>\n<h3>Master discovery via Service discovery and Proxy<\/h3>\n<p><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-4-proxy-heuristics\">Part 4<\/a> presented with an anti-pattern setup, where a proxy would infer the identify of the master by drawing conclusions from backend server checks. This led to split brains and undesired scenarios. The problem was the loss of context.<\/p>\n<p>We re-introduce a service discovery component (illustrated in <a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-3-app-service-discovery\">part 3<\/a>), such that:<\/p>\n<ul>\n<li>The app does not own the discovery, and<\/li>\n<li>The proxy behaves in an expected and consistent way.<\/li>\n<\/ul>\n<p>In a failover\/service discovery\/proxy setup, there is clear ownership of duties:<\/p>\n<ul>\n<li>The failover tool own the failover itself and the master identity change notification.<\/li>\n<li>The service discovery component is the source of truth as for the identity of the master of a cluster.<\/li>\n<li>The proxy routes traffic but does not make routing decisions.<\/li>\n<li>The app only ever connects to a single target, but should allow for a brief outage while failover takes place.<\/li>\n<\/ul>\n<p>Depending on the technologies used, we can further achieve:<\/p>\n<ul>\n<li>Hard cut for connections to old, demoted master <code>M<\/code>.<\/li>\n<li>Black\/hold off for incoming queries for the duration of failover.<\/li>\n<\/ul>\n<p>We explain the setup using the following assumptions and scenarios:<\/p>\n<ul>\n<li>All clients connect to master via <code>cluster1-writer.example.net<\/code>, which resolves to a proxy box.<\/li>\n<li>We fail over from master <code>M<\/code> to promoted replica <code>R<\/code>.<\/li>\n<\/ul>\n<p><!--more--><\/p>\n<h3>A non planned failover illustration #1<\/h3>\n<p>Master <code>M<\/code> has died, the box had a power failure. <code>R<\/code> gets promoted in its place. Our recovery tool:<\/p>\n<ul>\n<li>Updates service discovery component that <code>R<\/code> is the new master for <code>cluster1<\/code>.<\/li>\n<\/ul>\n<p>The proxy:<\/p>\n<ul>\n<li>Either actively or passively learns that <code>R<\/code> is the new master, rewires all writes to go to <code>R<\/code>.<\/li>\n<li>If possible, kills existing connections to <code>M<\/code>.<\/li>\n<\/ul>\n<p>The app:<\/p>\n<ul>\n<li>Needs to know nothing. Its connections to <code>M<\/code> fail, it reconnects and gets through to <code>R<\/code>.<\/li>\n<\/ul>\n<h3>A non planned failover illustration #2<\/h3>\n<p>Master <code>M<\/code> gets network isolated for <code>10<\/code> seconds, during which time we failover. <code>R<\/code> gets promoted.<\/p>\n<p>Everything is as before.<\/p>\n<p>If the proxy kills existing connections to <code>M<\/code>, then the fact <code>M<\/code> is back alive turns meaningless. No one gets through to <code>M<\/code>. Clients were never aware of its identity anyhow, just as they are unaware of <code>R<\/code>&#8216;s identity.<\/p>\n<h3>Planned failover illustration<\/h3>\n<p>We wish to replace the master, for maintenance reasons. We successfully and gracefully promote <code>R<\/code>.<\/p>\n<ul>\n<li>In the process of promotion, <code>M<\/code> turned read-only.<\/li>\n<li>Immediately following promotion, our failover tool updates service discovery.<\/li>\n<li>Proxy reloads having seen the changes in service discovery.<\/li>\n<li>Our app connects to <code>R<\/code>.<\/li>\n<\/ul>\n<h3>Discussion<\/h3>\n<p>This is a setup we use at GitHub in production. Our components are:<\/p>\n<ul>\n<li><code>orchestrator<\/code> for failover tool.<\/li>\n<li><em>Consul<\/em> for service discovery.<\/li>\n<li>GLB (HAProxy) for proxy<\/li>\n<li><em>Consul template<\/em> running on proxy hosts:\n<ul>\n<li>listening on changes to Consul&#8217;s KV data<\/li>\n<li>Regenerate <code>haproxy.cfg<\/code> configuration file<\/li>\n<li><code>reload<\/code> haproxy<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>As mentioned earlier, the apps need not change anything. They connect to a name that is always resolved to proxy boxes. There is never a DNS change.<\/p>\n<p>At the time of failover, the service discovery component must be up and available, to catch the change. Otherwise we do not strictly require it to be up at all times.<\/p>\n<p>For high availability we will have multiple proxies. Each of whom must listen on changes to K\/V. Ideally the name (<code>cluster1-writer.example.net<\/code> in our example) resolves to any available proxy box.<\/p>\n<ul>\n<li>This, in itself, is a high availability issue. Thankfully, managing the HA of a proxy layer is simpler than that of a MySQL layer. Proxy servers tend to be stateless and equal to each other.<\/li>\n<li>See GLB as one example for a highly available proxy layer. Cloud providers, Kubernetes, two level layered proxies, Linux Heartbeat, are all methods to similarly achieve HA.<\/li>\n<\/ul>\n<p>See also:<\/p>\n<ul>\n<li><a href=\"https:\/\/blog.pythian.com\/mysql-high-availability-with-haproxy-consul-and-orchestrator\/\">MySQL High Availability With HAProxy, Consul And Orchestrator<\/a><\/li>\n<li><a href=\"https:\/\/www.percona.com\/live\/18\/sessions\/automatic-failovers-with-kubernetes-using-orchestrator-proxysql-and-zookeeper\">Automatic Failovers with Kubernetes using Orchestrator, ProxySQL and Zookeeper<\/a><\/li>\n<li><a href=\"https:\/\/www.percona.com\/live\/e17\/sessions\/orchestrating-proxysql-with-orchestrator-and-consul\">Orchestrating ProxySQL with Orchestrator and Consul<\/a><\/li>\n<\/ul>\n<h3>Sample orchestrator configuration<\/h3>\n<p>An <code>orchestrator<\/code> configuration would look like this:<\/p>\n<pre><code class=\"json\">  \"ApplyMySQLPromotionAfterMasterFailover\": true,\n  \"KVClusterMasterPrefix\": \"mysql\/master\",\n  \"ConsulAddress\": \"127.0.0.1:8500\",\n  \"ZkAddress\": \"srv-a,srv-b:12181,srv-c\",\n  \"PostMasterFailoverProcesses\": [\n    \u201c\/just\/let\/me\/know about failover on {failureCluster}\u201c,\n  ],\n<\/code><\/pre>\n<p>In the above:<\/p>\n<ul>\n<li>If <code>ConsulAddress<\/code> is specified, <code>orchestrator<\/code> will update given <em>Consul<\/em> setup with K\/V changes.<\/li>\n<li>At <code>3.0.10<\/code>, <em>ZooKeeper<\/em>, via <code>ZkAddress<\/code>, is still not supported by <code>orchestrator<\/code>.<\/li>\n<li><code>PostMasterFailoverProcesses<\/code> is here just to point out hooks are not strictly required for the operation to run.<\/li>\n<\/ul>\n<p>See <a href=\"https:\/\/github.com\/github\/orchestrator\/blob\/master\/docs\/configuration.md\">orchestrator configuration<\/a> documentation.<\/p>\n<h3>All posts in this series<\/h3>\n<ul>\n<li><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-1-dns\">MySQL master discovery methods, part 1: DNS<\/a><\/li>\n<li><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-2-vip-dns\">MySQL master discovery methods, part 2: VIP &amp; DNS<\/a><\/li>\n<li><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-3-app-service-discovery\">MySQL master discovery methods, part 3: app &amp; service discovery<\/a><\/li>\n<li><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-4-proxy-heuristics\">MySQL master discovery methods, part 4: Proxy heuristics<\/a><\/li>\n<li><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-5-service-discovery-proxy\">MySQL master discovery methods, part 5: Service discovery &amp; Proxy<\/a><\/li>\n<li><a href=\"http:\/\/code.openark.org\/blog\/mysql\/mysql-master-discovery-methods-part-6-other-methods\">MySQL master discovery methods, part 6: other methods<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master. These posts are not concerned with the manner by [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[62,108,8],"class_list":["post-7869","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-high-availability","tag-orchestrator","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-22V","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7869","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=7869"}],"version-history":[{"count":7,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7869\/revisions"}],"predecessor-version":[{"id":7909,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/7869\/revisions\/7909"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=7869"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=7869"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=7869"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}