I had a call from a new customer last week. They had issues with their MySQL server. Apparently, it was repeatingly crashing, every few hours. To have their production system kept alive, they used a script to periodically see if MySQL was still alive, and if not – start it.
I was first thinking in directions of old installations; wrong memory allocations (too little memory for large content; to large memory allocation for weak machine). When I reached the customer’s premises, I quickly reviewed general settings, while explaining some basic configuration guidelines. There were peculiarities (e.g. query_cache_limit being larger than query_cache_size), but nothing to obviously suggest a reason for crash.
I then observed the error log. Took some time to find it, since the log_error parameter appeared twice in the my.cnf file; first one appearing at the very beginning, the second (overwriting the first) was far into the file.
Sure enough, there were a lot of startup messages. And… shutdown messages. In fact, the log looked something like:
090923 17:38:15 mysqld started 090923 17:38:16 InnoDB: Started; log sequence number 0 3707779031 090923 17:38:16 [Note] /usr/local/mysql/bin/mysqld: ready for connections. Version: '5.0.77-log' socket: '/tmp/mysql50.sock' port: 3306 MySQL Community Server (GPL) 090923 19:53:41 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown 090923 19:53:56 mysqld started 090923 19:53:56 InnoDB: Started; log sequence number 0 5288400927 090923 19:53:56 [Note] /usr/local/mysql/bin/mysqld: ready for connections. Version: '5.0.77-log' socket: '/tmp/mysql50.sock' port: 3306 MySQL Community Server (GPL) 090929 22:38:48 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown 090923 22:38:58 mysqld started 090923 22:38:58 InnoDB: Started; log sequence number 0 7102832776 090923 22:38:58 [Note] /usr/local/mysql/bin/mysqld: ready for connections. Version: '5.0.77-log' socket: '/tmp/mysql50.sock' port: 3306 MySQL Community Server (GPL) ...
(The above is just a sample, not the original file).
Well, the log says there’s a lot of normal shutdowns.
Looking at the script
Next it was time to look at the script which was supposed to verify MySQL was up and running – else wise start it. And it went something like this:
while [ 1 ]; if [`ps aux | grep mysqld | wc -l` -lt 2] then /etc/init.d/mysql restart ... sleep 60
The script was looking for all processes, then grepping for mysqld, counting number of lines in output. It expected 2 lines: one for the mysqld process, one for the grep mysqld process itself.
If you don’t know what’s wrong with this, a very brief explanation about how pipelines work in unix work it at place:
Pipelined processes do not execute one after another, or one before another. They are all executed at once. So, “ps aux | grep mysqld | wc -l” immediately spawns ps, grep, wc, then sets the standard output of one to the standard input of the other (most simplified description I could think of).
It is likely that grep will outrun ps in the race for starting up. So grep is started, already waiting for input from ps, which then notices both mysqld is alive, but also grep mysqld, lists them both along with other processes, to be filtered by grep, to be counted by wc (returning two lines count).
But this is just because ps is heavier than grep. It doesn’t have to be like that.
The less common scenario
Every once and again, ps outruns grep in the race for starting up. It would list the active processes, and no “grep” would appear in the listing. grep would then run, filtering the result of ps, then to be counted by wc — oh, there is only one line now! The script assumes this means mysqld is down (since it assumed grep is always there), and restarts mysqld.
So, the script which was supposed to take care of MySQL crashes, was actually causing them (though no crash occurred).
Better ways to test that MySQL is alive
Look for the pid file. This is the standard (that’s how the mysql service works). Look for the unix socket file. Both require that you parse the my.cnf file to learn where these files are.
If you’re reluctant to read the configuration files, other options are at hand. OK, look for the process; but use pgrep mysqld. No need to list all processes, then grep them.
And best way, that will set your mind at ease even if you’re worried that “mysql is running but not responding; it is stuck”: connect to MySQL and issue SELECT 1, SELECT NOW(), SELECT something. This would be the ultimate test that MySQL is up, listening, responding and valid.
Salami,
I tried your suggestion but got the following error
mysql> SELECT something;
ERROR 1054 (42S22): Unknown column ‘something’ in ‘field list’
Please help!!
@originalgeek@mac.com (#32)
In this case, allow me to further point out:
You are right that more than one instance of mysqld might be runnign. But your proposal of checking for the full binary path is still dangerous: with proper configuration it’s possible to run many instances using the same binary.
“…popping in to do a SELECT in mysql, when the daemon is torched, will take quite a few seconds for the connect request to timeout. …”
Not sure what “torched” means (not a native English speaker). But I fail to see you reasoning: if the daemon is up yet non responsive, your “ps | grep” will claim everything is just fine. However, connecting an querying will fail, which is what we want. Furthermore, it’s pretty easy to configure socket connections to timeout according to your pre-defined settings (did lots of these with Java, unrelated to MySQL), so you can timeout as soon as you like.
Even furthermore, if you can’t for some reason set the timeout as you please, you can expect the SQL-testing code to return by some finite time, or else you assume something’s wrong, which means you can monitor this. This gets more complicated, but the higher the requirements, the more complicated your code will get.
And — if the daemon is not up — timeout is instantaneous. So where’s the problem?
“…If you are going to tolerate a system where a SQL daemon is going offline, which you should not in the first place…”
I agree that you should not expect your MySQL daemon to go down. I *have* seen a few cases where it did, unexpectedly, repeatedly, on production. Reasons? Internal bugs, mismatching libraries, beta versions, faulty hardware.
“…making your client live with klunkiness like this…”
Not to get into the specifics of my work with the customer, you don’t wreck havoc and take everything apart to rebuild it better on first meeting.
To be strictly clear, I’m not offended or anything by the tone of your writing; and obviously there is no competition about “who’s got more experience in what” (I gladly accept that you are a top-notch linux sys admin), but I find your reasoning faulty.
Regards
Ok, obviously this is a silly way to check if a process is running. A better way to do this, although still silly, is just to run a cron job every mnute that does /etc/init.d/mysql start
If mysql is already running, nothing will happen, But if it’s not, it will start, unlses there’s an actual problem with mysql.
@Sean,
Depending on your distribution, your “mysqld start” may do just the same thing: check of pid file, pidof mysqld etc.
@schlomi I completely accept that there was something wrong with their script. However in my experience piped processes accept ALL of the stdin. They don’t somehow ‘lose’ input and return varying results. I have seen processes and grep being listed in different order but I have never seen any of the processes or grep NOT being listed whether they are executed simultaneously or not.
I suggest that they simply tested for -lt 2 and it should have been -lt 3.
No matter what everyone says about how it was a n00b error, everyone makes mistakes when writing scripts and the real mistake was not testing it and monitoring the logs before putting it into production.