Problems with MMM for MySQL

May 16, 2011

I recently encountered troubling issues with MMM for MySQL deployments, leading me to the decision to cease using it on production.

At the very same day I started writing about it, Baron published What's wrong with MMM?. I wish to present the problems I encountered and the reasons I find MMM is flawed. In a period of two weeks, two different deployments presented me with 4 crashes, in 3 different scenarios.

In all the following scenarios, there is an Active/Passive Master-Master deployment, with one VIP (virtual IP) set for writer role, one VIP set for reader role.

Problem #1: unjustified failover, broken replication

Unjustified failover must be the common scenario. It's also a scenario I can live with. A few seconds of downtime are OK with me once in a couple of months.

But on two different installations, a few days apart, I had two seemingly unjustified failovers followed by a troubling issue: replication got broken.

How broken? The previously active master, now turned inactive, suddenly changed master position to a roughly 10 days old position. I don't keep master logs for 10 days, so this led to an immediate replication fail.

Now, I cannot directly point my finger at MMM, but:

  • There was no power failure
  • MySQL daemon did not go down
  • Replication was just fine both ways up to the failover moment
  • There was no human intervention on this (myself and once more person had access at that time).

I know the above since I've got it all monitored. So I can't blame it on “replication not synching master.info file to disk when power went down”. I confess, this is very suspicious; but, twice, on two different deployments...

So much for suspicions. Now for the smoking guns.

Problem #2: hanging master, no failover

The active master went down. Either hardware or software problem caused it to freeze. It was not executing its chores; it became inaccessible by TCP/IP.

But not just inaccessible: freezing inaccessible. If you were to attempt an SSH connection, the connection would just hang; not refused. The SSH client would not terminate in any reasonable time.

Ahem, time to do failover?

Apparently not. Phones start ringing. Emails sent. Time for manual intervention. But, what does the MMM monitor have to say? Nothing. It's frozen. Now, I didn't read the source code; I'm not even competent with PERL; but is seems to me like the monitor daemon works single threaded: it attempts to connect all hosts on the same thread. But, connecting to active master makes for hanging connection, so the entire monitor goes down. Impossible to stop it gracefully, I had to kill it. I had the choice of reconfiguring it to ignore active master, but decided to try starting it up again. I had to do it twice before it started acting sanely again, and realized it was time for failover.

Why is this bad? Assuming my analysis is correct, this is a major design flaw. You must never do a single threaded monitoring on a multiple-machine deployment.

Problem #3: no servers

System is down! No access to the database. What does MMM monitor have to say?

Both machines are HARD_OFFLINE.

Ahem. Both machines are up and running; they are both replicating each other. They are both accessible. MySQL is accessible.

But neither machine has any VIP associated with.

Does it matter whether both agents fail to realize their associated MySQL servers were actually up and running, or whether the monitor fail to receive that information? It does not. MMM should not have removed all VIPs. OK, suppose it believes the two machines are down. So what? Just throw all VIPs on one of the machines. If it's inaccessible, then what's wrong with that? (assuming the previous problem never existed, that is).

You should always have some machine associated with the VIPs.

Temper down

None of the above is meant as offense to the creators of MMM, whom I greatly respect. This isn't an easy problem to solve, and it should be obvious there's no 100% guaranteed solution. But, for myself, I will not be using MMM as it stands right now anymore.

tags: ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

11 Comments to "Problems with MMM for MySQL"

  1. shinguz wrote:

    Hi shlomi,

    What do you consider as alternative? Active/passive failover clustering with DRBD/Heartbeat? Manual failover? Do nothing?

    Thanks for letting us knowing your conclusions and solutions.

    Regards,
    Oli

  2. shlomi wrote:

    @Oli,
    I'll be testing Pacemaker in the next few weeks, with manual failover. I'll let you know my conclusions.

    I've had good customer experience with Heartbeat before.

  3. Log Buffer #220, A Carnival of the Vanities for DBAs | The Pythian Blog wrote:

    [...] Shlomi Noach encountered troubling issues with MMM for MySQL deployments, leading me to the decision to cease using it on production, and hence blog about it. [...]

  4. Cédric wrote:

    My own “french” experience about MMM : http://www.mysqlplus.fr/2011/03/pourquoi-mmm-ne-fait-pas-ce-que-jaimerais-quil-fasse/

  5. Patrick wrote:

    Hi Shlomi,

    Thank you for the information - MMM was on my To-Do list for evaluation.

    Have spent the past week trying to get Pacemaker + Corosync configured to support a VIP in front of a Master/Master replication configuration. Works if one of the hosts goes down but can't get it the VIP to swing if the "Active" Master MySQL db goes down.

    Good luck with your testing of Pacemaker - and hope you will document your efforts ;-)

    Cheers
    Patrick

  6. Steven Roussey wrote:

    https://silverline.librato.com/blog/main/EC2_Users_Should_be_Cautious_When_Booting_Ubuntu_10_04_AMIs

    I haven't heard of many people using mmm on EC2, but just in case, they should look at the above link about process stalls. A mysql sever should have a different scheduler anyhow...

    On another note, what is your experience with tungsten?

  7. shlomi wrote:

    @Steven,
    I did not realize it was possible to use VIPs on amazon EC2; I thought it was not possible.

    I haven't used tungsten as yet (*ashamed*).

  8. MySQL高可用性大杀器之MHA | 火丁笔记 wrote:

    [...] 提到MySQL高可用性,很多人会想到MySQL Cluster,亦或者Heartbeat+DRBD,不过这些方案的复杂性常常让人望而却步,与之相对,利用MySQL复制实现高可用性则显得容易很多,目前大致有MMM,PRM,MHA等方案可供选择:MMM是最常见的方案,可惜它问题太多(What’s wrong with MMM,Problems with MMM for MySQL);至于PRM,它还是个新项目,暂时不推荐用于产品环境,不过作为Percona的作品,它值得期待;如此看来目前只能选MHA了,好在经过DeNA大规模的实践应用证明它是个靠谱的工具。 [...]

  9. MySQL高可用性大杀器之MHA-传播、沟通、分享-一直“有你” wrote:

    [...] 提到MySQL高可用性,很多人会想到MySQL Cluster,亦或者Heartbeat+DRBD,不过这些方案的复杂性常常让人望而却步,与之相对,利用MySQL复制实现高可用性则显得容易很多,目前大致有MMM,PRM,MHA等方案可供选择:MMM是最常见的方案,可惜它问题太多(What’s wrong with MMM,Problems with MMM for MySQL);至于PRM,它还是个新项目,暂时不推荐用于产品环境,不过作为Percona的作品,它值得期待;如此看来目前只能选MHA了,好在经过DeNA大规模的实践应用证明它是个靠谱的工具。 [...]

  10. The State of High Availability | MySQL Fanboy wrote:

    [...] Noach and Brian Aker have public discussed problems with MMM show just how hard this problem is. ( SH / BA ) Even Alexey Kovyrin said in a reply to Brian “…Every time I try to add HA to my [...]

  11. MySQL高可用性大杀器之MHA | OurMySQL | 我们致力于一个MySQL知识的分享网站 wrote:

    [...] 提到MySQL高可用性,很多人会想到MySQL Cluster,亦或者Heartbeat+DRBD,不过这些方案的复杂性常常让人望而却步,与之相对,利用MySQL复制实现高可用性则显得容易很多,目前大致有MMM,PRM,MHA等方案可供选择:MMM是最常见的方案,可惜它问题太多(What’s wrong with MMM,Problems with MMM for MySQL);至于PRM,它还是个新项目,暂时不推荐用于产品环境,不过作为Percona的作品,它值得期待;如此看来目前只能选MHA了,好在经过DeNA大规模的实践应用证明它是个靠谱的工具。 [...]

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org