{"id":1529,"date":"2009-11-11T05:50:28","date_gmt":"2009-11-11T03:50:28","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=1529"},"modified":"2009-11-12T07:45:29","modified_gmt":"2009-11-12T05:45:29","slug":"replication-analysis-with-mycheckpoint","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/replication-analysis-with-mycheckpoint","title":{"rendered":"Replication analysis with mycheckpoint"},"content":{"rendered":"<p>I would like to show how <a href=\"http:\/\/code.openark.org\/forge\/mycheckpoint\">mycehckpoint<\/a> (see <a href=\"http:\/\/code.openark.org\/blog\/mysql\/announcing-mycheckpoint-lightweight-sql-oriented-monitoring-for-mysql\">announcement<\/a>) can be put to use for analyzing various replication metrics.<\/p>\n<h4>Lagging slaves<\/h4>\n<p>A slave has been monitored. Monitoring started at a time when it was way behind master (about two days lag), but it has since caught up. This can be easily verified by the following chart:<\/p>\n<blockquote>\n<pre><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" title=\"seconds_behind_master\" src=\"http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400x200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&amp;chdl=seconds_behind_master&amp;chdlp=b&amp;chco=ff8c00&amp;chd=s:976431zzzywutrpnliiifdbZYXVTRRRPNLJHEBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&amp;chxt=x,y&amp;chxr=1,0,169811&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&amp;chxs=0,505050,10\" alt=\"\" width=\"400\" height=\"200\" \/><\/pre>\n<\/blockquote>\n<p>The above chart can be obtained by viewing the HTML report:<\/p>\n<blockquote>\n<pre>SELECT <strong>html<\/strong> FROM sv_report_html<\/pre>\n<\/blockquote>\n<p>Or by directly issuing the query:<\/p>\n<blockquote>\n<pre>mysql&gt; SELECT <strong>seconds_behind_master<\/strong> FROM sv_report_chart_hour\\G\r\n*************************** 1. row ***************************\r\nseconds_behind_master: http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400x200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&amp;chdl=seconds_behind_master&amp;chdlp=b&amp;chco=ff8c00&amp;chd=s:976431zzzywutrpnliiifdbZYXVTRRRPNLJHEBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&amp;chxt=x,y&amp;chxr=1,0,169811&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&amp;chxs=0,505050,10<\/pre>\n<\/blockquote>\n<p>This is all nice. But I&#8217;m also interested in the <em>rate<\/em> at which slave lag decreased. Many ignore this important metric: just how fast does your slave replicate?<\/p>\n<h4><!--more-->Slave catchup speed<\/h4>\n<p>So we&#8217;re looking for the <em>slope<\/em> of the above graph. Luckily, the slope is nothing but the difference in <strong>seconds_behind_master<\/strong> divided by elapsed time. <em>mycheckpoint<\/em> supports the <strong>*_psec<\/strong> metrics for all known variables.<\/p>\n<p>Again, we can just look at the HTML report, or we can issue:<\/p>\n<blockquote>\n<pre>mysql&gt; SELECT <strong>seconds_behind_master_psec<\/strong> FROM sv_report_chart_hour\\G\r\n*************************** 1. row ***************************\r\nseconds_behind_master_psec: http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400x200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&amp;chdl=seconds_behind_master_psec&amp;chdlp=b&amp;chco=ff8c00&amp;chd=s:fbdebXU0wvOgaYXVUIutFTacefXXS4xhYRTJAKf90444444444444444444444482444444445444444444444472544444444444444444444481544444&amp;chxt=x,y&amp;chxr=1,-2.35,0.19&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&amp;chxs=0,505050,10<\/pre>\n<\/blockquote>\n<p>And get this chart:<\/p>\n<blockquote>\n<pre><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" title=\"seconds_behind_master_psec\" src=\"http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400x200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&amp;chdl=seconds_behind_master_psec&amp;chdlp=b&amp;chco=ff8c00&amp;chd=s:fbdebXU0wvOgaYXVUIutFTacefXXS4xhYRTJAKf90444444444444444444444482444444445444444444444472544444444444444444444481544444&amp;chxt=x,y&amp;chxr=1,-2.35,0.19&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&amp;chxs=0,505050,10\" alt=\"\" width=\"400\" height=\"200\" \/><\/pre>\n<\/blockquote>\n<p>The numbers show a reasonable slave catchup speed (lower is better). For example, a <strong>-1<\/strong> value indicates that the slave is replicating twice as fast as the master. So if we&#8217;re 10 minutes behind, catch up will occur 10 minutes from now (the slave will cover for the missing 10 minutes plus the future 10 minutes by which the master will be writing). At times, replication went very fast (slope as low as <strong>-2.25<\/strong> on average), and at times is was very slow (<strong>-0.25<\/strong>).<\/p>\n<p>But all negative numbers are good! They mean slave is faster than master. Indeed, once the slave caught up, numbers climbed to <strong>0<\/strong>: the slave has since been running at same speed as the master.<\/p>\n<h4>Estimating slave catchup time<\/h4>\n<p>When we have a lagging slave, there&#8217;s nothing more interesting to known than &#8220;when will it finally catch up?&#8221;. We can do the math: calculating slope, extrapolating the expected point in time where the lag drops to <strong>0<\/strong>; or we can just ask!<\/p>\n<p>Let&#8217;s first ask the hard way, then the human way:<\/p>\n<p>Querying the history of estimations:<\/p>\n<blockquote>\n<pre>SELECT ts, <strong>estimated_slave_catchup_seconds<\/strong> FROM sv_report_hour;\r\n+---------------------+---------------------------------+\r\n| ts                  | estimated_slave_catchup_seconds |\r\n+---------------------+---------------------------------+\r\n| 2009-11-06 16:00:00 |                          142984 |\r\n| 2009-11-06 17:00:00 |                           47988 |\r\n| 2009-11-06 18:00:00 |                           31416 |\r\n| 2009-11-06 19:00:00 |                           22896 |\r\n| 2009-11-06 20:00:00 |                           20202 |\r\n| 2009-11-06 21:00:00 |                           13062 |\r\n| 2009-11-06 22:00:00 |                            7932 |\r\n| 2009-11-06 23:00:00 |                            5220 |\r\n| 2009-11-07 00:00:00 |                            2740 |\r\n| 2009-11-07 01:00:00 |                            NULL |\r\n| 2009-11-07 02:00:00 |                             463 |\r\n| 2009-11-07 03:00:00 |                            NULL |\r\n| 2009-11-07 04:00:00 |                            NULL |\r\n| 2009-11-07 05:00:00 |                            NULL |\r\n| 2009-11-07 06:00:00 |                            NULL |\r\n...\r\n+---------------------+---------------------------------+<\/pre>\n<\/blockquote>\n<p>(NULL occurs when slave has caught up; so nothing to estimate)<\/p>\n<p>Had we been checking up on human reports at the time of lag, we would see something like:<\/p>\n<blockquote>\n<pre>mysql&gt; SELECT <strong>report<\/strong> FROM sv_report_human_hour ORDER BY ID DESC LIMIT 1\\G\r\n*************************** 1. row ***************************\r\nreport:\r\nReport period: 2009-11-06 16:00:00 to 2009-11-06 17:00:00. Period is 60 minutes (1.00 hours)\r\nUptime: 100.0% (Up: 1 days, 05:34:00 hours)\r\n\r\n...\r\n...\r\n...\r\n\r\nReplication:\r\n    Master status file number: 541, position: 825383347\r\n    Relay log space limit: 10737418240, used: 4512632192  (42.0%)\r\n    Seconds behind master: 47185\r\n    <strong>Estimated time for slave to catch up: 142984 seconds (1 days, 15:43:05 hours)  ETA: 2009-11-08 07:43:05<\/strong><\/pre>\n<\/blockquote>\n<p>The human report suggested it would take more than a day and a half for replication to catch up; it also provided with ETA: &#8216;<strong><\/strong><strong>2009-11-08 07:43:05<\/strong>&#8216;.<\/p>\n<p>As an interesting note, to issue the above report I <em>went back in time<\/em> and issued:<\/p>\n<blockquote>\n<pre>SELECT report FROM sv_report_human_hour <strong>WHERE ts='2009-11-06 16:00:00'<\/strong>\\G<\/pre>\n<\/blockquote>\n<p>It&#8217;s possible to retroactively generate report, for as long as data still exists.<\/p>\n<h4>Checking up on relay log space<\/h4>\n<p>Slaves are limited in relay log space. They can only accumulate as much relay logs.<\/p>\n<p>Usually the slave SQL thread is slower than the IO thread, which means when the slave lags behind, it writes more and more relay logs, while having hard time reading them. Once it runs out of relay log space, it stops asking the master for more binary logs, and waits for relay log space to free.<\/p>\n<p>How much relay log space should a slave have?<\/p>\n<p>Well, if you&#8217;re not worries about disk space, let it have as much as it would like to; but let&#8217;s look at the implications of relay log space limit. Again, by either looking at the HTML report or by directly issuing:<\/p>\n<blockquote>\n<pre>mysql&gt; SELECT relay_log_used_mb FROM sv_report_chart_hour\\G\r\n*************************** 1. row ***************************\r\nrelay_log_used_mb: http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400x200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+09:00+(4+days,+23+hours)&amp;chdl=relay_log_space_limit_mb|relay_log_space_mb&amp;chdlp=b&amp;chco=ff8c00,4682b4&amp;chd=s:999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,9999999999989570xytsuxursojfhcabWRSOJEGBCBCDEGBCEBDEFGBEGBDFBCEFGBCDEGBDFBDFCFACEGBCDEGBCBBDEFACEFBCEGBCDEFABCEBEDFACDEF&amp;chxt=x,y&amp;chxr=1,0,10240.10309792&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+09:00|Nov+9,+09:00|&amp;chxs=0,505050,10<\/pre>\n<\/blockquote>\n<p>We get:<\/p>\n<blockquote>\n<pre><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" title=\"relay_log_used_mb\" src=\"http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400x200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+09:00+(4+days,+23+hours)&amp;chdl=relay_log_space_limit_mb|relay_log_space_mb&amp;chdlp=b&amp;chco=ff8c00,4682b4&amp;chd=s:999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,9999999999989570xytsuxursojfhcabWRSOJEGBCBCDEGBCEBDEFGBEGBDFBCEFGBCDEGBDFBDFCFACEGBCDEGBCBBDEFACEFBCEGBCDEFABCEBEDFACDEF&amp;chxt=x,y&amp;chxr=1,0,10240.10309792&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+09:00|Nov+9,+09:00|&amp;chxs=0,505050,10\" alt=\"\" width=\"400\" height=\"200\" \/><\/pre>\n<\/blockquote>\n<p>I see that 2 days lag (comparing with the above charts) will cause the relay log space to fill up. I would consider this to be OK, since I don&#8217;t normally expect such lags. I can see that a one-day lag is well within relay log space. I conclude that the <strong>relay_log_space_limit<\/strong> server variable is properly configured in my case.<\/p>\n<div id=\"_mcePaste\" style=\"overflow: hidden; position: absolute; left: -10000px; top: 19px; width: 1px; height: 1px;\">http:\/\/chart.apis.google.com\/chart?cht=lc&amp;chs=400&#215;200&amp;chts=303030,12&amp;chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&amp;chdl=seconds_behind_master_psec&amp;chdlp=b&amp;chco=ff8c00&amp;chd=s:fbdebXU0wvOgaYXVUIutFTacefXXS4xhYRTJAKf90444444444444444444444482444444445444444444444472544444444444444444444481544444&amp;chxt=x,y&amp;chxr=1,-2.35,0.19&amp;chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&amp;chxs=0,505050,10<\/div>\n","protected":false},"excerpt":{"rendered":"<p>I would like to show how mycehckpoint (see announcement) can be put to use for analyzing various replication metrics. Lagging slaves A slave has been monitored. Monitoring started at a time when it was way behind master (about two days lag), but it has since caught up. This can be easily verified by the following [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[25,46,49,8],"class_list":["post-1529","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-analysis","tag-monitoring","tag-mycheckpoint","tag-replication"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-oF","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1529","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=1529"}],"version-history":[{"count":26,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1529\/revisions"}],"predecessor-version":[{"id":1571,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1529\/revisions\/1571"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=1529"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=1529"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=1529"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}