How NOT to test that mysqld is alive

October 1, 2009

I had a call from a new customer last week. They had issues with their MySQL server. Apparently, it was repeatingly crashing, every few hours. To have their production system kept alive, they used a script to periodically see if MySQL was still alive, and if not - start it.

I was first thinking in directions of old installations; wrong memory allocations (too little memory for large content; to large memory allocation for weak machine). When I reached the customer's premises, I quickly reviewed general settings, while explaining some basic configuration guidelines. There were peculiarities (e.g. query_cache_limit being larger than query_cache_size), but nothing to obviously suggest a reason for crash.

I then observed the error log. Took some time to find it, since the log_error parameter appeared twice in the my.cnf file; first one appearing at the very beginning, the second (overwriting the first) was far into the file.

Sure enough, there were a lot of startup messages. And... shutdown messages. In fact, the log looked something like:

090923 17:38:15  mysqld started
090923 17:38:16  InnoDB: Started; log sequence number 0 3707779031
090923 17:38:16 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.77-log'  socket: '/tmp/mysql50.sock'  port: 3306  MySQL Community Server (GPL)
090923 19:53:41 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown
090923 19:53:56  mysqld started
090923 19:53:56  InnoDB: Started; log sequence number 0 5288400927
090923 19:53:56 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.77-log'  socket: '/tmp/mysql50.sock'  port: 3306  MySQL Community Server (GPL)
090929 22:38:48 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown
090923 22:38:58  mysqld started
090923 22:38:58  InnoDB: Started; log sequence number 0 7102832776
090923 22:38:58 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.77-log'  socket: '/tmp/mysql50.sock'  port: 3306  MySQL Community Server (GPL)
...

(The above is just a sample, not the original file).

Well, the log says there's a lot of normal shutdowns.

Looking at the script

Next it was time to look at the script which was supposed to verify MySQL was up and running - else wise start it. And it went something like this:

while [ 1 ];
  if [`ps aux | grep mysqld | wc -l` -lt 2]
    then /etc/init.d/mysql restart
  ...
  sleep 60

The script was looking for all processes, then grepping for mysqld, counting number of lines in output. It expected 2 lines: one for the mysqld process, one for the grep mysqld process itself.

If you don't know what's wrong with this, a very brief explanation about how pipelines work in unix work it at place:

Pipelined processes do not execute one after another, or one before another. They are all executed at once. So, "ps aux | grep mysqld | wc -l" immediately spawns ps, grep, wc, then sets the standard output of one to the standard input of the other (most simplified description I could think of).

It is likely that grep will outrun ps in the race for starting up. So grep is started, already waiting for input from ps, which then notices both mysqld is alive, but also grep mysqld, lists them both along with other processes, to be filtered by grep, to be counted by wc (returning two lines count).

But this is just because ps is heavier than grep. It doesn't have to be like that.

The less common scenario

Every once and again, ps outruns grep in the race for starting up. It would list the active processes, and no "grep" would appear in the listing. grep would then run, filtering the result of ps, then to be counted by wc -- oh, there is only one line now! The script assumes this means mysqld is down (since it assumed grep is always there), and restarts mysqld.

So, the script which was supposed to take care of MySQL crashes, was actually causing them (though no crash occurred).

Better ways to test that MySQL is alive

Look for the pid file. This is the standard (that's how the mysql service works). Look for the unix socket file. Both require that you parse the my.cnf file to learn where these files are.

If you're reluctant to read the configuration files, other options are at hand. OK, look for the process; but use pgrep mysqld. No need to list all processes, then grep them.

And best way, that will set your mind at ease even if you're worried that "mysql is running but not responding; it is stuck": connect to MySQL and issue SELECT 1, SELECT NOW(), SELECT something. This would be the ultimate test that MySQL is up, listening, responding and valid.

tags: ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

61 Comments to "How NOT to test that mysqld is alive"

  1. Istvan POdor wrote:

    ohh those good old times when were so beginners.. This article makes me remember my own mistakes :)

  2. Tim Jinkerson wrote:

    or
    if [`ps -aux | grep mysqld | grep -v grep | wc -l` -lt 1]

    Remove the grep from your output before you count it.

  3. Mark Leith wrote:

    "Just use mysqld_safe, huh?" :)

  4. shlomi wrote:

    Mark,
    Yes :D

    They had MySQL installed on Gentoo Linux. I'm not familiar with Gentoo at all; apparently the mysql package did not utilize mysqld_safe, but some Gentoo specific script to start/stop services.

    On same note I usually prefer binary distribution to packaged distribution of MySQL; every linux distro wraps it up differently, uses different paths. conventions.
    With binary I set all up manually from grounds up.

  5. kedar wrote:

    How about "mysqld status"?

  6. shlomi wrote:

    @kedar,

    thanks, yes. To elaborate, "mysqld status" will do the following:
    look for the PID ;
    try kill -0
    attempt pidof mysqld

    Really, connecting to MySQL and initiating a query is, in my opinion, the most reliable way.

  7. kedar wrote:

    Correct I just went through the /etc/rc.d/init.d/functions for "status".

  8. shlomi wrote:

    @kedar,
    so did i ;)

  9. Ronald Bradford wrote:

    When I first read your article my initial though was , the number needs to be 2. As @mark wrote, where is mysqld_safe
    It bothers me when distro's think they can do stuff better.

    It is sometimes of benefit to have multiple checks to determine or triangulate issues. Other checks include mysqladmin ping, netstat for 3306, however anything that checks for pid's, ports, sockets etc are bound to having to support the variable nature etc.

    mysqld_safe is designed to undertake exactly the situation your script attempts to describe. Seems the most logical solution is to implement mysqld_safe.

  10. shlomi wrote:

    @Ronald,

    Not my script :)

    I've made a few recommendations for the customer.
    The customer has no expert sys-admin. For some reason they prefer Gentoo to other, more popular distros; I therefore did not wish to complicate stuff by throwing in a binary distribution they didn't know how to handle. Using the Gentoo repository was their familiar method of installing new apps.

    Repeating my #4 comment, I personally prefer binary distribution (tgz download) and manual install over distro's package; more control over which version; no accidental undesired upgrades etc.

  11. Gerry wrote:

    Need to know the socket (or any variable for that matter)? In bash:

    my_print_defaults mysqld | grep socket | cut -d "=" -f 2

    Wrap in $(...) and you can include it in any alias / script.

    Another way to check if it's up:

    mysqladmin -u user -ppass version

    The user could be one w/ only USAGE granted.

    My $.02
    G

  12. shlomi wrote:

    @Gerry,

    Cool $.02

  13. mark wrote:

    I do phone technical support. I never even considered this possibility. I will keep this in mind. Thanks for the knowledge drop.

    BTW, I like Tim Jinkerson's way of doing this. It's probably the easiest way to do it.

  14. cacdn wrote:

    if you really must use grep for finding a process, try

    | grep [m]ysqld

  15. Mina wrote:

    tangent:

    To address the general problem of grepping theough ps's output and grep finding itself, there are a couple of options:

    * Switch to "pgrep"

    * Make use of a clever trick: Grep supports regexes. If you switch the first letter of what you're grepping for with a regex's character class (by surrounding it with [ ]), it'll still match what you want, but not the grep pattern itself: example: "grep [m]ysqld"

  16. Chad Miller wrote:

    The best thing to to is to make a regular expression that does not match itself.

    ps x |grep fo[o]

    The brackets of the character class containing the single letter are metacharacters that contribute to how the grep matches, and while the text that it matches is exactly the same as with no character class, it ensures that the characters that make it up do not match.

  17. Josh wrote:

    You can also use "ps -C mysqld > /dev/null" and test $?.

  18. Simon wrote:

    You could always use:

    if [`ps -aux | grep [mM]ysqld | wc -l` -lt 1]
    then blah blah blah
    fi

  19. DerekP wrote:

    While there are better ways of doing this that have been noted, there are a couple ways they could've made their chosen method work better.

    A workaround I've used in the past is to replace "grep mysqld" with "egrep [m]ysqld", so that grep doesn't find itself in the ps list.

    Also, a lot of headache could have been saved if they made some type of log entry saying that the script was restarting the service instead of just issuing the command. This would also help finding trends.

  20. Daddio wrote:

    I have always used "mysqladmin ping" to test whether the service is alive and check for exit status of 0

    mysqladmin ping
    if [ $? -ne 0 ]
    then
    /etc/init.d/mysqld start
    else
    echo "All is well"
    fi

  21. JohnK wrote:

    It all sounded very technical, your description of piped processes running simultaneously but in my experience earlier ones always feed into later ones. The issue was more likely a logic error somewhere or simply not testing for the right number. A cut and paste of the actual faulty script would have been good.
    On my Linux box a normal mysql install runs 2 processes and the grep process would take the minimum count to 3. So they should perhaps have been testing for less than 3 and not 2.
    However my main point is that I seriously doubt the accuracy of your pipe explanation as it does not tally with my many scripting 'adventures'.
    Incidentally a nice way to discount the grep would be
    [ `ps -ef | grep [m]ysqld | wc -l` -lt 2 ]

  22. Mihai Limbasan wrote:

    Instead of connecting and running a query you can use mysqladmin ping then check the exit status (zero for alive, non-zero for dead.) That's the "official" way to check it, won't have the overhead and the potential problems of running a query, and will avoid false positives from other processes with names that include mysql.

  23. Fred wrote:

    If really using pgrep, at least do it correctly: pgrep "^mysql$" (or whatever the process is named, taking care of quoting any special meaning characters.)

  24. RobA wrote:

    Really the best way is the Query, When you are really old you will realize that even when a process is hung it still shows up in ps listings. Do the query because it confirms the process is handling requests and not overloaded.

  25. sambeau wrote:

    I wrote a little C program & an article on how to do exactly this at my last company..

    http://knowledgehub.zeus.com/articles/2006/02/27/custom_monitors_for_mysql_in_c

    Might be of help to anyone who as read this.

    Cheers.

  26. Lucas wrote:

    Please, everyone, stop using these hacks. At their very best, they are inelegant and have race conditions. Instead, use something like daemontools or runit. They (a) automatically restart a daemon that dies, and (b) provide race-free reporting of PID and uptime without a kludge like "ps | grep".

  27. Chad wrote:

    Never check for pid files. Never ever ever. It is entirely possible for a process to die abruptly (say, after getting a SIGKILL, after a kernel panic, power outage, the list is endless) and not have the opportunity to remove its pidfile. The single best test mechanism is to directly query the daemon with a short, low-intensity test.

    Hell, or just do `ps aux | grep -v grep |grep mysqld | wc -l`

  28. Adam Nelson wrote:

    This is craziness.

    1. Why is a company without a Sys Admin running Gentoo (not your fault - but somebody needs to tell them that the scenarios is worse than less than ideal). Ubuntu really is a far superior solution for them.
    2. MySQL doesn't just 'go down'. I understand hacks are necessary sometimes, but auto-restarts usually cause more problems than they solve. Automatic failover (i.e. master-slave switch) is a better solution if maintaining uptime without intervention is necessary.
    3. An uptime script should use app-specific code like 'SELECT * FROM table1 LIMIT 1' to confirm that the database isn't corrupt or locked while the server is still running. That query would be in addition to something like 'SELECT now()' or mysqladmin ping which checks for connectivity. The setup costs of these connections are trivial.

    I really think the SysAdmin community needs to start taking a harder line on this type of stuff and the company should be told that whoever wrote that script needs to know that it's totally unacceptable that this happened in the first place.

  29. Ammon Lauritzen wrote:

    ps -Cmysqld --no-header

    Will return exactly one line per running 'mysqld' process. No grep required.

  30. Zach wrote:

    This is how I do it.

    sqlsocket=`grep socket /etc/my.cnf | awk -F = '{ print $2 } '`

    count=0;
    echo -n "Checking MySQL"

    while [ ! -S "${sqlsocket}" ]; do
    sleep 1;
    echo -n "."
    count=$((count+1));
    [ $count -gt 5 ] &done

  31. shlomi wrote:

    @JohnK

    I sincerely assure you that I've seen this happening with my own eyes, even while sitting at the customer's office: by typing the script manually a few times, I did in fact stumbled on the rare case where it fails.
    I can't see how you mean "logic error somewhere or simply not testing for the right number"; while I can't put my hands right now on the original script, I assure you it was all on the `ps aux | grep mysqld | wc -l` part.

    Regards

  32. originalgeek@mac.com wrote:

    Your n00bness comes oozing out right through your proposed solutions.

    If you are going to tolerate a system where a SQL daemon is going offline, which you should not in the first place, the first solution is the correct one, with a modification: grep for the full path to mysqld.

    This is a superior technique to any of the other techniques you mentioned. Why? Let me enumerate:

    pgrep will tell you if any mysqld is running on your system, not necessarily the one you want running. same problem with looking for sockets.

    popping in to do a SELECT in mysql, when the daemon is torched, will take quite a few seconds for the connect request to timeout. since this is the case you are looking for, when it is necessary to restart the daemon, why do you want that to take even longer. also, `ps x | grep /full/path/to/mysqld | wc -l` is fairly inexpensive and you can run it once a second rather than once a minute.

    very poor example of open source advocacy, making your client live with klunkiness like this.

  33. shlomi wrote:

    @Adam (#28)
    1. I agree, and I have.
    2. That's why they looked up a consultant; they couldn't explain this behavior (albeit it was created by their own doing)
    3. completely agree

    "...SysAdmin community...whoever wrote that script needs to know that it’s totally unacceptable..."
    The man who wrote the script is not a sys admin expert. Neither am I, for that matter. I am experienced in many fields, yet still make mistakes. When I'm tired, I can make newbie mistakes. It happens. I agree that a sys-admin review of one's system in in place once in a while.

  34. shlomi wrote:

    @Chad (#27)

    Thanks. Yes. But take a look at the mysqld service source code, and see they actually use both methods...

  35. shlomi wrote:

    Thanks everyone for your comments!

  36. Jake wrote:

    Salami,

    I tried your suggestion but got the following error

    mysql> SELECT something;
    ERROR 1054 (42S22): Unknown column 'something' in 'field list'

    Please help!!

  37. shlomi wrote:

    @originalgeek@mac.com (#32)

    In this case, allow me to further point out:
    You are right that more than one instance of mysqld might be runnign. But your proposal of checking for the full binary path is still dangerous: with proper configuration it's possible to run many instances using the same binary.

    "...popping in to do a SELECT in mysql, when the daemon is torched, will take quite a few seconds for the connect request to timeout. ..."
    Not sure what "torched" means (not a native English speaker). But I fail to see you reasoning: if the daemon is up yet non responsive, your "ps | grep" will claim everything is just fine. However, connecting an querying will fail, which is what we want. Furthermore, it's pretty easy to configure socket connections to timeout according to your pre-defined settings (did lots of these with Java, unrelated to MySQL), so you can timeout as soon as you like.
    Even furthermore, if you can't for some reason set the timeout as you please, you can expect the SQL-testing code to return by some finite time, or else you assume something's wrong, which means you can monitor this. This gets more complicated, but the higher the requirements, the more complicated your code will get.
    And -- if the daemon is not up -- timeout is instantaneous. So where's the problem?

    "...If you are going to tolerate a system where a SQL daemon is going offline, which you should not in the first place..."
    I agree that you should not expect your MySQL daemon to go down. I *have* seen a few cases where it did, unexpectedly, repeatedly, on production. Reasons? Internal bugs, mismatching libraries, beta versions, faulty hardware.

    "...making your client live with klunkiness like this..."
    Not to get into the specifics of my work with the customer, you don't wreck havoc and take everything apart to rebuild it better on first meeting.

    To be strictly clear, I'm not offended or anything by the tone of your writing; and obviously there is no competition about "who's got more experience in what" (I gladly accept that you are a top-notch linux sys admin), but I find your reasoning faulty.

    Regards

  38. Sean wrote:

    Ok, obviously this is a silly way to check if a process is running. A better way to do this, although still silly, is just to run a cron job every mnute that does /etc/init.d/mysql start

    If mysql is already running, nothing will happen, But if it's not, it will start, unlses there's an actual problem with mysql.

  39. shlomi wrote:

    @Sean,
    Depending on your distribution, your "mysqld start" may do just the same thing: check of pid file, pidof mysqld etc.

  40. JohnK wrote:

    @schlomi I completely accept that there was something wrong with their script. However in my experience piped processes accept ALL of the stdin. They don't somehow 'lose' input and return varying results. I have seen processes and grep being listed in different order but I have never seen any of the processes or grep NOT being listed whether they are executed simultaneously or not.
    I suggest that they simply tested for -lt 2 and it should have been -lt 3.
    No matter what everyone says about how it was a n00b error, everyone makes mistakes when writing scripts and the real mistake was not testing it and monitoring the logs before putting it into production.

  41. Ali wrote:

    pgrep seems like the best idea, but a very quick way to fix the original is just to use `ps -e` since it won't list process arguments.

  42. shlomi wrote:

    @JohnK (#40)

    John, just now, on my laptop, see the following real output:

    shlomi@******:~$ ps aux | grep mysqld
    root      8708  0.0  0.0   4020   620 pts/0    S    Oct04   0:00 /bin/sh ./bin/mysqld_safe --datadir=/usr/local/mysql/data --pid-file=/usr/local/mysql/data/piglet.pid
    mysql     8738  0.0  4.8 313584 150308 pts/0   Sl   Oct04   1:21 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --user=mysql --pid-file=/usr/local/mysql/data/******.pid --skip-external-locking --port=3306 --socket=/tmp/mysql50.sock
    shlomi@******:~$ ps aux | grep mysqld
    shlomi    2175  0.0  0.0   7452   876 pts/5    S+   22:28   0:00 grep mysqld
    root      8708  0.0  0.0   4020   620 pts/0    S    Oct04   0:00 /bin/sh ./bin/mysqld_safe --datadir=/usr/local/mysql/data --pid-file=/usr/local/mysql/data/piglet.pid
    mysql     8738  0.0  4.8 313584 150308 pts/0   Sl   Oct04   1:21 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --user=mysql --pid-file=/usr/local/mysql/data/******.pid --skip-external-locking --port=3306 --socket=/tmp/mysql50.sock
    

    Except for exchanging my host name with ******, all else is copy+paste.

    For your convenicene, try out this script on your machine:

    while [ 1 ] 
    do
      if [ `ps aux | grep mysqld | wc -l` -lt 3 ]; then 
        echo `date`
        break
      fi
    done
    

    Regards

    PS -- you're absolutely right about the problem being that the script was untested.

  43. P wrote:

    How about:

    ps ax | grep mysql[d]

    and do the necessary wc logic

  44. shlomi wrote:

    @Jake (#36),

    My mistake: should read as
    SELECT that_other_thing;

  45. JohnK wrote:

    @shlomi. Ok. I see what you see now. However I only see it in the context of a statement within backticks.
    Your laptop example shows a consistent 3 processes on mine. (Opensuse 11.0) I.e. ps aux | grep mysqld shows 2 mysqld processes and a grep process every time. An if statement with a backtick test statement though, drops the grep statement at random.
    I've never seen that before as I've always filtered the grep process in scripts. (... and never tried to implemented such a stupid restart policy.) I stand corrected.

  46. shlomi wrote:

    @JohnK

    Interesting about only backticks on your distro/machine.

    Cheers

  47. Siddharth Bhattacharya wrote:

    Here is the corrected Startup Script say
    #start.sh
    while [ 1 ];
    if [ `ps aux | grep mysql | grep -v mysql | wc -l` -lt 1 ]
    then /etc/init.d/mysql restart
    ...

    Also add a cronjob using command:
    crontab -e
    * * * * * start.sh

  48. sharks wrote:

    Mihai @ 22: 'mysqladmin ping' wins :-)
    You win a ... uh, something.

    Everyone else discussing ps and various perverse grep misadventures, please don't do it. All it takes is someone running
    'mysqldump' or 'man mysqld' or 'vi /tmp/why_is_mysqld_restarting.txt' to generate a false positive.

    pidfiles are also unreliable if the server has died unexpectedly.

    Use 'mysqladmin ping'. It goes away, parses the mysql config files for you, connects to the mysqld.sock and checks that mysqld is actually working.

  49. shlomi wrote:

    @sharks
    I agree, please re-read the last paragraph in my post.
    Please also note the duality within mysql: "mysqladmin ping" conencts via socket etc., while "mysqld status" checks for pid file and process id.

    Regards

  50. FredV wrote:

    looks like some sysadmin wannabe does not even know how to use awk and regular expressions, any kid should see the holes in that solution

    for procf in /proc/[0-9]*; do
    if [ "$(cat $procf/exename)" = "/usr/bin/mysqld" ]; then
    echo "process is running"
    fi
    done

  51. ajd4096 wrote:

    Install MySQL Enterprise Monitor!

    Note that it runs a separate instance of mysqld, so "ps|grep|wc" tricks are suddenly useless.

    Not that they ever were a good trick, for a number of reasons.
    * The race condition mentioned does occur.
    * "ps -C" is not portable.
    * Processes hang sometimes, functional testing is required.
    * Monitoring "foo" should not break when "bar" is run, if/even when "bar" is "foo2".

  52. Adam Nelson wrote:

    I'm a little disturbed by all the people still posting ps and proc options.

    It's fun that these tricks can be done, but any of that type of custom scripting is not extensible, rarely portable, and highly brittle under a panoply of eventualities.

    I know there are systems that have been up for 10 years on this kind of stuff, but it's not responsible for such band-aids to be allowed to be permanent (or for SysAdmins to think that those are real, proper solutions).

  53. FredV wrote:

    Adam: using /proc (a standard Unix feature) and bash-compatible shell script is not portable and brittle? i disagree. asfor extensible: who on earth would want to build something on top and exapand on this? It's a quick hack anyway, normally you shouldn't have to do stuff like this.

    btw. using /proc, which is *not* as stupid as grepping ps output without even using awk to split into columns. a process can change how it's displayed in ps by changing argv[0], so the /proc solution I gave is actually the only one that will always work and not brittle as you say

  54. ajd4096 wrote:

    @FredV
    /proc is not a standard Unix feature, nor is bash.

    There is more to unix than the linux distro du-jour.

  55. FredV wrote:

    "/proc is not a standard Unix feature, nor is bash."

    that's why every unix has it?

  56. Zach wrote:

    @FredV

    you are totally incorrect, /proc is not in every Unix... /proc has been depreciated in older SysV style OS'es... like ..wait for it... HPUX!

  57. ajd4096 wrote:

    /proc is not deprecated (nor depreciated) on HP-UX, it never had it, AFAIK.

    FreeBSD/OpenBSD/NetBSD have a procfs, but it is not mounted by default.

    (The main reason for avoiding procfs is that reading from /proc in user-space suffers from TOCTOU, whereas a system call does not. This issue was raised yet again on the OpenBSD misc@ mailing list only a few weeks ago.)

    There is a procfs for OSX, it's MacFUSE-based, not from Apple, and not even installed by default.

    Obviously there is a sizeable percentage of unix systems where a script referring to /proc will not work.

    Now, about the ubiquity of that "bash" shell....

  58. Vitlalie Cherpec wrote:

    I'm using Nagios (http://www.nagios.org/) to monitor mysql servers. Reinventing the wheel is too error prone, just stick with stable and tested solutions. If Nagios itself is too complex, Nagios plugins can be used from scripts. :

    Usage help for check_mysql plugin:

    [vitalie@shark ~]$ /usr/lib64/nagios/plugins/check_mysql --help

    check_mysql v2034 (nagios-plugins 1.4.13)
    Copyright (c) 1999-2007 Nagios Plugin Development Team

    This program tests connections to a mysql server

    Usage: check_mysql [-d database] [-H host] [-P port] [-s socket]
    [-u user] [-p password] [-S]

    [...]

  59. Ariel wrote:

    Suppose someone is interested in monitoring (the unlikely event of) whether MySQL is frozen, overloaded, unresponsive, etc., but it hasn't crashed -- i.e., the pid and sock files exist and the mysql process is listed by ps. In this scenario, whatever method determines the database to be frozen will trigger a DRBD or cluster failover script, which this scenario assumes is already in place. (And the reason that, btw, Nagios can't help here.)

    Now, how would you handle that?

    It seems to me that logging in, querying, or doing anything in the database risks complicating things further. What if the query/check doesn't complete in time? I define "in time" here to mean before the next check, since I assume this check will be running on a cron or monit job at a fairly short interval to justify the failover, but you can choose any definition.

    My contention here is that there really is no way to meaningfully check on the internal status of MySQL from the host it's running on when an automated failover event depends on the outcome. Leaving aside the question of why would anyone would want to be in this scenario (believe me, I already asked the question, and the answer is self-evident), I'd like to know of any possible solutions. All the comments I've read here and elsewhere lead me to believe there isn't.

  60. shlomi wrote:

    @Ariel,

    A possible heuristic would be to connect and issue a SHOW PROCESSLIST (this does not make for an overhead). By the number of running threads and the time they are waiting ; or if InnoDB Plugin is enabled, by the number of running transactions / locked transactions, you may deduce that things are not going well.
    You may choose to monitor the above for 5 seconds, and detect that no thread/transaction has completed since previous sample; this may indicate a "stuck mysql".

    This is just off the top of my head; haven't put serious thought yet.

  61. Ariel wrote:

    Thanks, Shlomi. I'm just concerned that if MySQL is truly stuck, the query won't exit or timeout soon enough before the next time it's supposed to run, causing a cascading effect.

    Worse yet, I'm concerned about generating a false positive that would trigger a needless failover. I'm working on a monit-based solution that would hopefully avoid those situations.

    Again, thanks for you suggestion.

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org