{"id":3625,"date":"2011-05-16T07:31:32","date_gmt":"2011-05-16T05:31:32","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=3625"},"modified":"2011-05-16T09:34:10","modified_gmt":"2011-05-16T07:34:10","slug":"problems-with-mmm-for-mysql","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/problems-with-mmm-for-mysql","title":{"rendered":"Problems with MMM for MySQL"},"content":{"rendered":"<p>I recently encountered troubling issues with MMM for MySQL deployments, leading me to the decision to cease using it on production.<\/p>\n<p>At the very same day I started writing about it, Baron published <a href=\"http:\/\/www.xaprb.com\/blog\/2011\/05\/04\/whats-wrong-with-mmm\/\">What&#8217;s wrong with MMM?<\/a>. I wish to present the problems I encountered and the reasons I find MMM is flawed. In a period of two weeks, two different deployments presented me with <strong>4<\/strong> crashes, in <strong>3<\/strong> different scenarios.<\/p>\n<p>In all the following scenarios, there is an Active\/Passive Master-Master deployment, with one VIP (virtual IP) set for <em>writer<\/em> role, one VIP set for <em>reader<\/em> role.<\/p>\n<h4>Problem #1: unjustified failover, broken replication<\/h4>\n<p>Unjustified failover must be the common scenario. It&#8217;s also a scenario I can live with. A few seconds of downtime are OK with me once in a couple of months.<\/p>\n<p>But on two different installations, a few days apart, I had two seemingly unjustified failovers followed by a troubling issue: replication got broken.<!--more--><\/p>\n<p>How broken? The previously active master, now turned inactive, suddenly changed master position to a roughly <strong>10<\/strong> days old position. I don&#8217;t keep master logs for <strong>10<\/strong> days, so this led to an immediate replication fail.<\/p>\n<p>Now, I cannot directly point my finger at MMM, but:<\/p>\n<ul>\n<li>There was no power failure<\/li>\n<li>MySQL daemon did not go down<\/li>\n<li>Replication was just fine both \tways up to the failover moment<\/li>\n<li>There was no human intervention on \tthis (myself and once more person had access at that time).<\/li>\n<\/ul>\n<p>I know the above since I&#8217;ve got it all monitored. So I can&#8217;t blame it on \u201creplication not synching <strong>master.info<\/strong> file to disk when power went down\u201d. I confess, this is very suspicious; but, <em>twice<\/em>, on two different deployments&#8230;<\/p>\n<p>So much for suspicions. Now for the smoking guns.<\/p>\n<h4>Problem #2: hanging master, no failover<\/h4>\n<p>The active master went down. Either hardware or software problem caused it to freeze. It was not executing its chores; it became inaccessible by TCP\/IP.<\/p>\n<p>But not just inaccessible: freezing inaccessible. If you were to attempt an SSH connection, the connection would just hang; not refused. The SSH client would not terminate in any reasonable time.<\/p>\n<p>Ahem, time to do failover?<\/p>\n<p>Apparently not. Phones start ringing. Emails sent. Time for manual intervention. But, what does the MMM monitor have to say? Nothing. It&#8217;s frozen. Now, I didn&#8217;t read the source code; I&#8217;m not even competent with PERL; but is <em>seems<\/em> to me like the monitor daemon works single threaded: it attempts to connect all hosts on the same thread. But, connecting to active master makes for hanging connection, so the entire monitor goes down. Impossible to stop it gracefully, I had to kill it. I had the choice of reconfiguring it to ignore active master, but decided to try starting it up again. I had to do it twice before it started acting sanely again, and realized it was time for failover.<\/p>\n<p>Why is this bad? Assuming my analysis is correct, this is a major design flaw. You must never do a single threaded monitoring on a multiple-machine deployment.<\/p>\n<h4>Problem #3: no servers<\/h4>\n<p>System is down! No access to the database. What does MMM monitor have to say?<\/p>\n<p>Both machines are HARD_OFFLINE.<\/p>\n<p>Ahem. Both machines are up and running; they are both replicating each other. They are both accessible. MySQL is accessible.<\/p>\n<p>But neither machine has any VIP associated with.<\/p>\n<p>Does it matter whether both agents fail to realize their associated MySQL servers were actually up and running, or whether the monitor fail to receive that information? It does not. MMM should not have removed all VIPs. OK, suppose it believes the two machines are down. So what? Just throw all VIPs on one of the machines. If it&#8217;s inaccessible, then what&#8217;s wrong with that? (assuming the previous problem never existed, that is).<\/p>\n<p>You should always have <em>some<\/em> machine associated with the VIPs.<\/p>\n<h4>Temper down<\/h4>\n<p>None of the above is meant as offense to the creators of MMM, whom I greatly respect. This isn&#8217;t an easy problem to solve, and it should be obvious there&#8217;s no 100% guaranteed solution. But, for myself, I will not be using MMM as it stands right now anymore.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently encountered troubling issues with MMM for MySQL deployments, leading me to the decision to cease using it on production. At the very same day I started writing about it, Baron published What&#8217;s wrong with MMM?. I wish to present the problems I encountered and the reasons I find MMM is flawed. In a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[62,8],"class_list":["post-3625","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-high-availability","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-Wt","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/3625","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=3625"}],"version-history":[{"count":6,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/3625\/revisions"}],"predecessor-version":[{"id":3631,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/3625\/revisions\/3631"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=3625"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=3625"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=3625"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}