{"id":6793,"date":"2014-04-18T13:15:09","date_gmt":"2014-04-18T11:15:09","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=6793"},"modified":"2014-04-18T20:02:49","modified_gmt":"2014-04-18T18:02:49","slug":"anemomaster-dml-visibility-your-must-do-for-tomorrow","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/anemomaster-dml-visibility-your-must-do-for-tomorrow","title":{"rendered":"&#8220;Anemomaster&#8221;: DML visibility. Your must-do for tomorrow"},"content":{"rendered":"<p>Here&#8217;s our take of master DML query monitoring at <a href=\"http:\/\/www.outbrain.com\/\">Outbrain<\/a> (<a href=\"https:\/\/www.percona.com\/live\/mysql-conference-2014\/sessions\/mysql-devops-outbrain\">presented April 2014<\/a>). It took a half-day to code, implement, automate and deploy, and within the first hour of work we managed to catch multiple ill-doing services and scripts. You might want to try this out for yourself.<\/p>\n<h4>What&#8217;s this about?<\/h4>\n<p>What queries do you monitor on your MySQL servers? Many don&#8217;t monitor queries at all, and only look up slow queries on occasion, using <a href=\"http:\/\/www.percona.com\/doc\/percona-toolkit\/2.2\/pt-query-digest.html\">pt-query-digest<\/a>. Some monitor slow queries, where <a href=\"https:\/\/github.com\/box\/Anemometer\/wiki\">Anemometer<\/a> (relying on pt-query-digest) is a very good tool. To the extreme, some monitor TCP traffic on all MySQL servers &#8212; good for you! In between, there&#8217;s a particular type of queries that are of special interest: DML (<strong>INSERT\/UPDATE\/DELETE<\/strong>) queries issued against the master.<\/p>\n<p>They are of particular interest because they are only issued once against the master, yet propagate through replication topology to execute on all slaves. These queries have a direct impact on your slave lag and on your overall replication capacity. I suggest you should be familiar with your DMLs just as you are with your slow queries.<\/p>\n<p>In particular, we had multiple occasions in the past where all or most slaves started lagging. Frantically we would go to our metrics; yes! We would see a spike in <strong>com_insert<\/strong>. Someone (some service) was obviously generating more <strong>INSERT<\/strong>s than usual, at a high rate that the slaves could not keep up with. But, <em>which<\/em> <strong>INSERT<\/strong> was that? Blindly, we would look at the binary logs. Well, erm, what are we looking for, exactly?<\/p>\n<p>Two such occasions convinced us that there should be a solution, but it took some time till it hit us. We were already using <em>Anemometer<\/em> for monitoring our slow logs. We can do the same for monitoring our binary logs. Thus was born <em>&#8220;Anemomaster&#8221;<\/em>.<\/p>\n<p>Quick recap on how Anemometer works: you issue <em>pt-query-digest<\/em> on your slow logs on all MySQL hosts (we actually first ship the slow logs to a central place where we analyse them; same thing). This is done periodically, and slow logs are then rotated. You throw the output of <em>pt-query-digest<\/em> to a central database (this is built in with <em>pt-query-digest<\/em>; it doesn&#8217;t necessarily produce human readable reports). <em>Anemometer<\/em> would read this central database and visualize the slow queries.<\/p>\n<h4>Analysing DMLs<\/h4>\n<p>But then, <em>pt-query-digest<\/em> doesn&#8217;t only parse slow logs. It can parse binary logs. Instead of asking for total query time, we ask for query count, and on we go to establish the same mechanism, using same <em>pt-query-digest<\/em> and same <em>Anemometer<\/em> to store and visualize the DMLs issued on our masters.<\/p>\n<p>When analysing DMLs we&#8217;re interested in parsing binary logs &#8212; and it makes no sense to do the same on all slaves. All slaves just have same copy of binlog entries as the master produces. It only takes <em>one<\/em> server to get an accurate picture of the DMLs on your replication topology.<\/p>\n<p><!--more--><\/p>\n<p>One server could be the master, and this can indeed be done: just <strong>FLUSH MASTER LOGS<\/strong>, parse the binary logs with <em>pt-query-digest<\/em>, and you&#8217;re done. But like others, we tend to look at our masters as tender babies. We care for them, and do not wish to overload them unnecessarily. We chose to get the binlog entries from our slaves, instead. We also chose to get the entries from the relay logs, since these are unaffected by slave performance and as long as network is good, we can expect the relay logs to be <em>very<\/em> up to date. At any given time we have two slaves that take this role (this is automated and verified). On a <strong>10<\/strong> minute period we would flush the relay logs on these servers, and analyse whatever relay logs we have not analysed as yet.<\/p>\n<p>The script below is a slightly modified version of our own, and should work for the standard installation. Modify to fit your own data (in particular, it assumes relay logs are named <strong>mysqld-relay-bin<\/strong>; <strong>datadir<\/strong> is specified in <strong>\/etc\/my.cnf<\/strong>, and please don&#8217;t ask me how to do this on Windows):<\/p>\n<blockquote>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n#!\/bin\/bash\r\n\r\n# \r\n# Digest latest relay logs file, write results to &quot;anemomaster&quot;\r\n#\r\n# This script can run from any machine; it only needs to execute on a single machine per mysql cluster, but for\r\n# analysis availability it should execute on at least two hosts per cluster.\r\n#\r\n\r\nDATADIR=`grep datadir \/etc\/my.cnf|awk -F= '{print $2}'`\r\nTMPDIR=\/tmp\r\nDIGESTED_RELAY_LOGS_FILE=${DATADIR}\/digested_relay_logs.txt\r\n\r\ntouch $DIGESTED_RELAY_LOGS_FILE\r\nchown mysql:mysql $DIGESTED_RELAY_LOGS_FILE\r\n\r\nhostname=$(hostname)\r\n\r\necho &quot;deleting old relay logs from ${TMPDIR}&quot;\r\nrm ${TMPDIR}\/mysqld-relay-bin.&#x5B;0-9]*\r\n\r\necho &quot;Getting current relay log files&quot;\r\nexisting_relay_log_files=$(ls -tr ${DATADIR}\/mysqld-relay-bin.&#x5B;0-9]* | head -n -1)\r\nfor existing_relay_log_file in $existing_relay_log_files\r\ndo\r\n    cp -u $existing_relay_log_file $TMPDIR\r\ndone\r\necho &quot;flushing relay logs&quot;\r\n\/usr\/bin\/mysql -umyself -psecret -e 'flush relay logs\\G;' 2&gt;\/dev\/null\r\n# sleep because the log file takes some time to disappear\r\nsleep 1\r\n\r\necho &quot;Getting current relay log files&quot;\r\nexisting_relay_log_files=$(ls -tr ${DATADIR}\/mysqld-relay-bin.&#x5B;0-9]* | head -n -1)\r\nfor existing_relay_log_file in $existing_relay_log_files\r\ndo\r\n    cp -u $existing_relay_log_file $TMPDIR\r\ndone\r\n\r\ncd $TMPDIR\r\nfor relay_log_file in mysqld-relay-bin.&#x5B;0-9]*\r\ndo\r\n    # Filter this relay log file, since it's already been digested\r\n    grep $relay_log_file $DIGESTED_RELAY_LOGS_FILE &amp;&amp; rm $relay_log_file\r\ndone\r\n\r\nfor relay_log_file in mysqld-relay-bin.&#x5B;0-9]*\r\ndo\r\n    echo &quot;digesting $relay_log_file&quot;\r\n    mysqlbinlog $relay_log_file | \/usr\/bin\/pt-query-digest \\\r\n      --type binlog --order-by Query_time:cnt --group-by fingerprint --limit 100 \\\r\n      --review  P=3306,u=anemomaster,p=secret,h=anemomaster_host,D=anemomaster,t=global_query_review \\\r\n      --history P=3306,u=anemomaster,p=secret,h=anemomaster_host,D=anemomaster,t=global_query_review_history \\\r\n      --filter=&quot; \\$event-&gt;{Bytes} = length(\\$event-&gt;{arg}) and \\$event-&gt;{hostname}=\\&quot;$(hostname)\\&quot; &quot; \\\r\n      --no-report\r\n    echo &quot;$relay_log_file&quot; &gt;&gt; $DIGESTED_RELAY_LOGS_FILE\r\n    rm $relay_log_file\r\ndone\r\n# make sure the file does not bloat. 20 entries is more than enough.\r\ntail -n 20 $DIGESTED_RELAY_LOGS_FILE &gt; ${TMPDIR}\/DIGESTED_RELAY_LOGS_FILE\r\ncat ${TMPDIR}\/DIGESTED_RELAY_LOGS_FILE &gt; $DIGESTED_RELAY_LOGS_FILE\r\necho &quot;done&quot;\r\n<\/pre>\n<\/blockquote>\n<p>As for Anemometer, we patched it to support multiple environments (&#8220;clusters&#8221;) &#8212; but irrelevant to the DML change. If you just want to make it visualize DMLs, here&#8217;s the major configuration changes to <strong>config.inc.php<\/strong> (marked with <strong>bold<\/strong>):<\/p>\n<blockquote>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\n$conf&#x5B;'history_defaults'] = array(\r\n\t'output'\t\t=&gt; 'table',\r\n\t'fact-group'\t=&gt; 'date',\r\n\t'fact-order'\t=&gt; 'date DESC',\r\n\t'fact-limit' =&gt; '90',\r\n\t'dimension-ts_min_start' =&gt; date(&quot;Y-m-d H:i:s&quot;, strtotime( '-90 day')),\r\n\t'dimension-ts_min_end'\t=&gt; date(&quot;Y-m-d H:i:s&quot;),\r\n\t'table_fields' =&gt; array('date', 'query_time_avg','ts_cnt','Query_time_sum')\r\n);\r\n\r\n$conf&#x5B;'report_defaults'] = array(\r\n\t'fact-group'\t=&gt; 'checksum',\r\n\t'fact-order'\t=&gt; 'ts_cnt DESC',\r\n\t'fact-limit' =&gt; '20',\r\n\t'dimension-ts_min_start' =&gt; date(&quot;Y-m-d H:i:s&quot;, strtotime( '-1 day')),\r\n\t'dimension-ts_min_end'\t=&gt; date(&quot;Y-m-d H:i:s&quot;),\r\n\t'table_fields' =&gt; array('checksum','snippet', 'query_time_avg','ts_cnt','Query_time_sum'),\r\n\t'dimension-pivot-hostname_max' =&gt; null,\r\n\t'dimension-pivot-clustername_max' =&gt; null\r\n);\r\n\r\n$conf&#x5B;'graph_defaults'] = array(\r\n\t'fact-group'\t=&gt; 'minute_ts',\r\n\t'fact-order'\t=&gt; 'minute_ts',\r\n\t'fact-limit' =&gt; '',\r\n\t'dimension-ts_min_start' =&gt; date(&quot;Y-m-d H:i:s&quot;, strtotime( '-7 day')),\r\n\t'dimension-ts_min_end'\t=&gt; date(&quot;Y-m-d H:i:s&quot;),\r\n\t'table_fields' =&gt; array('minute_ts'),\r\n\t\/\/ hack ... fix is to make query builder select the group and order fields,\r\n\t\/\/ then table fields only has to contain the plot_field\r\n\t'plot_field' =&gt; 'ts_cnt',\r\n);\r\n<\/pre>\n<\/blockquote>\n<p>With a <strong>10<\/strong> minute rotation &amp; digestion, we are able to analyze near real-time what&#8217;s been done on our masters. If we see a spike in <strong>com_insert\/com_update\/com_delete<\/strong>, or just see slave lags, we turn to <em>Anemomaster <\/em>and within a couple minutes know exactly what service is guilty of abusing our database. We are also working to protect our database against abuse, but that&#8217;s for another discussion.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s our take of master DML query monitoring at Outbrain (presented April 2014). It took a half-day to code, implement, automate and deploy, and within the first hour of work we managed to catch multiple ill-doing services and scripts. You might want to try this out for yourself. What&#8217;s this about? What queries do you [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[106,105,46,57,8],"class_list":["post-6793","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-anemometer","tag-devops","tag-monitoring","tag-open-source","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1Lz","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/6793","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=6793"}],"version-history":[{"count":15,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/6793\/revisions"}],"predecessor-version":[{"id":6829,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/6793\/revisions\/6829"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=6793"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=6793"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=6793"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}