{"id":1355,"date":"2009-10-01T09:17:27","date_gmt":"2009-10-01T07:17:27","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=1355"},"modified":"2009-10-01T09:17:27","modified_gmt":"2009-10-01T07:17:27","slug":"how-not-to-test-that-mysqld-is-alive","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/how-not-to-test-that-mysqld-is-alive","title":{"rendered":"How NOT to test that mysqld is alive"},"content":{"rendered":"<p>I had a call from a new customer last week. They had issues with their MySQL server. Apparently, it was repeatingly crashing, every few hours. To have their production system kept alive, they used a script to periodically see if MySQL was still alive, and if not &#8211; start it.<\/p>\n<p>I was first thinking in directions of old installations; wrong memory allocations (too little memory for large content; to large memory allocation for weak machine). When I reached the customer&#8217;s premises, I quickly reviewed general settings, while explaining some basic configuration guidelines. There were peculiarities (e.g. <strong>query_cache_limit<\/strong> being larger than <strong>query_cache_size<\/strong>), but nothing to obviously suggest a reason for crash.<\/p>\n<p>I then observed the error log. Took some time to find it, since the <strong>log_error<\/strong> parameter appeared twice in the <strong>my.cnf<\/strong> file; first one appearing at the very beginning, the second (overwriting the first) was far into the file.<\/p>\n<p>Sure enough, there were a lot of startup messages. And&#8230; shutdown messages. In fact, the log looked something like:<!--more--><\/p>\n<blockquote>\n<pre>090923 17:38:15\u00a0 mysqld started\r\n090923 17:38:16\u00a0 InnoDB: Started; log sequence number 0 3707779031\r\n090923 17:38:16 [Note] \/usr\/local\/mysql\/bin\/mysqld: ready for connections.\r\nVersion: '5.0.77-log'\u00a0 socket: '\/tmp\/mysql50.sock'\u00a0 port: 3306\u00a0 MySQL Community Server (GPL)\r\n090923 19:53:41 [Note] \/usr\/local\/mysql\/bin\/mysqld: Normal shutdown\r\n090923 19:53:56\u00a0 mysqld started\r\n090923 19:53:56\u00a0 InnoDB: Started; log sequence number 0 5288400927\r\n090923 19:53:56 [Note] \/usr\/local\/mysql\/bin\/mysqld: ready for connections.\r\nVersion: '5.0.77-log'\u00a0 socket: '\/tmp\/mysql50.sock'\u00a0 port: 3306\u00a0 MySQL Community Server (GPL)\r\n090929 22:38:48 [Note] \/usr\/local\/mysql\/bin\/mysqld: Normal shutdown\r\n090923 22:38:58\u00a0 mysqld started\r\n090923 22:38:58\u00a0 InnoDB: Started; log sequence number 0 7102832776\r\n090923 22:38:58 [Note] \/usr\/local\/mysql\/bin\/mysqld: ready for connections.\r\nVersion: '5.0.77-log'\u00a0 socket: '\/tmp\/mysql50.sock'\u00a0 port: 3306\u00a0 MySQL Community Server (GPL)\r\n...<\/pre>\n<\/blockquote>\n<p>(The above is just a sample, not the original file).<\/p>\n<p>Well, the log says there&#8217;s a lot of <em>normal<\/em> shutdowns.<\/p>\n<h4>Looking at the script<\/h4>\n<p>Next it was time to look at the script which was supposed to verify MySQL was up and running &#8211; else wise start it. And it went something like this:<\/p>\n<blockquote>\n<pre><strong>while<\/strong> [ 1 ];\r\n  <strong>if<\/strong> [`ps aux | grep mysqld | wc -l` <strong>-lt<\/strong> 2]\r\n    <strong>then<\/strong> \/etc\/init.d\/mysql restart\r\n  ...\r\n  sleep 60<\/pre>\n<\/blockquote>\n<p>The script was looking for all processes, then <strong>grep<\/strong>ping for <strong>mysqld<\/strong>, counting number of lines in output. It expected 2 lines: one for the <strong>mysqld<\/strong> process, one for the <strong>grep mysqld<\/strong> process itself.<\/p>\n<p>If you don&#8217;t know what&#8217;s wrong with this, a very brief explanation about how pipelines work in unix work it at place:<\/p>\n<p>Pipelined processes do not execute one after another, or one before another. They are all executed <em>at once<\/em>. So, &#8220;<strong>ps aux | grep mysqld | wc -l&#8221;<\/strong> immediately spawns <strong>ps<\/strong>, <strong>grep<\/strong>, <strong>wc<\/strong>, then sets the standard output of one to the standard input of the other (most simplified description I could think of).<\/p>\n<p>It is likely that <strong>grep<\/strong> will outrun <strong>ps<\/strong> in the race for starting up. So <strong>grep<\/strong> is started, already waiting for input from <strong>ps<\/strong>, which then notices both <strong>mysqld<\/strong> is alive, but also <strong>grep mysqld<\/strong>, lists them both along with other processes, to be filtered by <strong>grep<\/strong>, to be counted by <strong>wc <\/strong>(returning two lines count).<\/p>\n<p>But this is just because <strong>ps<\/strong> is heavier than <strong>grep<\/strong>. It doesn&#8217;t <em>have<\/em> to be like that.<\/p>\n<h4>The less common scenario<\/h4>\n<p>Every once and again, <strong>ps<\/strong> outruns <strong>grep<\/strong> in the race for starting up. It would list the active processes, and no &#8220;grep&#8221; would appear in the listing. <strong>grep<\/strong> would then run, filtering the result of <strong>ps<\/strong>, then to be counted by <strong>wc<\/strong> &#8212; oh, there is only one line now! The script assumes this means mysqld is down (since it assumed <strong>grep<\/strong> is always there), and <em>restarts mysqld<\/em>.<\/p>\n<p>So, the script which was supposed to take care of MySQL crashes, was actually causing them (though no crash occurred).<\/p>\n<h4>Better ways to test that MySQL is alive<\/h4>\n<p>Look for the <em>pid<\/em> file. This is the standard (that&#8217;s how the mysql service works). Look for the unix socket file. Both require that you parse the my.cnf file to learn where these files are.<\/p>\n<p>If you&#8217;re reluctant to read the configuration files, other options are at hand. OK, look for the process; but use <strong>pgrep mysqld<\/strong>. No need to list all processes, then <strong>grep<\/strong> them.<\/p>\n<p>And best way, that will set your mind at ease even if you&#8217;re worried that &#8220;mysql is running but not responding; it is stuck&#8221;: connect to MySQL and issue <strong>SELECT 1<\/strong>, <strong>SELECT NOW()<\/strong>, <strong>SELECT <em>something<\/em><\/strong>. This would be the ultimate test that MySQL is up, listening, responding and valid.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I had a call from a new customer last week. They had issues with their MySQL server. Apparently, it was repeatingly crashing, every few hours. To have their production system kept alive, they used a script to periodically see if MySQL was still alive, and if not &#8211; start it. I was first thinking in [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[25,46],"class_list":["post-1355","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-analysis","tag-monitoring"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-lR","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1355","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=1355"}],"version-history":[{"count":10,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1355\/revisions"}],"predecessor-version":[{"id":1365,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1355\/revisions\/1365"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=1355"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=1355"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=1355"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}