How NOT to test that mysqld is alive

I had a call from a new customer last week. They had issues with their MySQL server. Apparently, it was repeatingly crashing, every few hours. To have their production system kept alive, they used a script to periodically see if MySQL was still alive, and if not – start it.

I was first thinking in directions of old installations; wrong memory allocations (too little memory for large content; to large memory allocation for weak machine). When I reached the customer’s premises, I quickly reviewed general settings, while explaining some basic configuration guidelines. There were peculiarities (e.g. query_cache_limit being larger than query_cache_size), but nothing to obviously suggest a reason for crash.

I then observed the error log. Took some time to find it, since the log_error parameter appeared twice in the my.cnf file; first one appearing at the very beginning, the second (overwriting the first) was far into the file.

Sure enough, there were a lot of startup messages. And… shutdown messages. In fact, the log looked something like:

090923 17:38:15  mysqld started
090923 17:38:16  InnoDB: Started; log sequence number 0 3707779031
090923 17:38:16 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.77-log'  socket: '/tmp/mysql50.sock'  port: 3306  MySQL Community Server (GPL)
090923 19:53:41 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown
090923 19:53:56  mysqld started
090923 19:53:56  InnoDB: Started; log sequence number 0 5288400927
090923 19:53:56 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.77-log'  socket: '/tmp/mysql50.sock'  port: 3306  MySQL Community Server (GPL)
090929 22:38:48 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown
090923 22:38:58  mysqld started
090923 22:38:58  InnoDB: Started; log sequence number 0 7102832776
090923 22:38:58 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.77-log'  socket: '/tmp/mysql50.sock'  port: 3306  MySQL Community Server (GPL)
...

(The above is just a sample, not the original file).

Well, the log says there’s a lot of normal shutdowns.

Looking at the script

Next it was time to look at the script which was supposed to verify MySQL was up and running – else wise start it. And it went something like this:

while [ 1 ];
  if [`ps aux | grep mysqld | wc -l` -lt 2]
    then /etc/init.d/mysql restart
  ...
  sleep 60

The script was looking for all processes, then grepping for mysqld, counting number of lines in output. It expected 2 lines: one for the mysqld process, one for the grep mysqld process itself.

If you don’t know what’s wrong with this, a very brief explanation about how pipelines work in unix work it at place:

Pipelined processes do not execute one after another, or one before another. They are all executed at once. So, “ps aux | grep mysqld | wc -l” immediately spawns ps, grep, wc, then sets the standard output of one to the standard input of the other (most simplified description I could think of).

It is likely that grep will outrun ps in the race for starting up. So grep is started, already waiting for input from ps, which then notices both mysqld is alive, but also grep mysqld, lists them both along with other processes, to be filtered by grep, to be counted by wc (returning two lines count).

But this is just because ps is heavier than grep. It doesn’t have to be like that.

The less common scenario

Every once and again, ps outruns grep in the race for starting up. It would list the active processes, and no “grep” would appear in the listing. grep would then run, filtering the result of ps, then to be counted by wc — oh, there is only one line now! The script assumes this means mysqld is down (since it assumed grep is always there), and restarts mysqld.

So, the script which was supposed to take care of MySQL crashes, was actually causing them (though no crash occurred).

Better ways to test that MySQL is alive

Look for the pid file. This is the standard (that’s how the mysql service works). Look for the unix socket file. Both require that you parse the my.cnf file to learn where these files are.

If you’re reluctant to read the configuration files, other options are at hand. OK, look for the process; but use pgrep mysqld. No need to list all processes, then grep them.

And best way, that will set your mind at ease even if you’re worried that “mysql is running but not responding; it is stuck”: connect to MySQL and issue SELECT 1, SELECT NOW(), SELECT something. This would be the ultimate test that MySQL is up, listening, responding and valid.

61 thoughts on “How NOT to test that mysqld is alive

  1. @FredV

    you are totally incorrect, /proc is not in every Unix… /proc has been depreciated in older SysV style OS’es… like ..wait for it… HPUX!

  2. /proc is not deprecated (nor depreciated) on HP-UX, it never had it, AFAIK.

    FreeBSD/OpenBSD/NetBSD have a procfs, but it is not mounted by default.

    (The main reason for avoiding procfs is that reading from /proc in user-space suffers from TOCTOU, whereas a system call does not. This issue was raised yet again on the OpenBSD misc@ mailing list only a few weeks ago.)

    There is a procfs for OSX, it’s MacFUSE-based, not from Apple, and not even installed by default.

    Obviously there is a sizeable percentage of unix systems where a script referring to /proc will not work.

    Now, about the ubiquity of that “bash” shell….

  3. I’m using Nagios (http://www.nagios.org/) to monitor mysql servers. Reinventing the wheel is too error prone, just stick with stable and tested solutions. If Nagios itself is too complex, Nagios plugins can be used from scripts. :

    Usage help for check_mysql plugin:

    [vitalie@shark ~]$ /usr/lib64/nagios/plugins/check_mysql –help

    check_mysql v2034 (nagios-plugins 1.4.13)
    Copyright (c) 1999-2007 Nagios Plugin Development Team

    This program tests connections to a mysql server

    Usage: check_mysql [-d database] [-H host] [-P port] [-s socket]
    [-u user] [-p password] [-S]

    […]

  4. Suppose someone is interested in monitoring (the unlikely event of) whether MySQL is frozen, overloaded, unresponsive, etc., but it hasn’t crashed — i.e., the pid and sock files exist and the mysql process is listed by ps. In this scenario, whatever method determines the database to be frozen will trigger a DRBD or cluster failover script, which this scenario assumes is already in place. (And the reason that, btw, Nagios can’t help here.)

    Now, how would you handle that?

    It seems to me that logging in, querying, or doing anything in the database risks complicating things further. What if the query/check doesn’t complete in time? I define “in time” here to mean before the next check, since I assume this check will be running on a cron or monit job at a fairly short interval to justify the failover, but you can choose any definition.

    My contention here is that there really is no way to meaningfully check on the internal status of MySQL from the host it’s running on when an automated failover event depends on the outcome. Leaving aside the question of why would anyone would want to be in this scenario (believe me, I already asked the question, and the answer is self-evident), I’d like to know of any possible solutions. All the comments I’ve read here and elsewhere lead me to believe there isn’t.

  5. @Ariel,

    A possible heuristic would be to connect and issue a SHOW PROCESSLIST (this does not make for an overhead). By the number of running threads and the time they are waiting ; or if InnoDB Plugin is enabled, by the number of running transactions / locked transactions, you may deduce that things are not going well.
    You may choose to monitor the above for 5 seconds, and detect that no thread/transaction has completed since previous sample; this may indicate a “stuck mysql”.

    This is just off the top of my head; haven’t put serious thought yet.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.