Replication analysis with mycheckpoint

I would like to show how mycehckpoint (see announcement) can be put to use for analyzing various replication metrics.

Lagging slaves

A slave has been monitored. Monitoring started at a time when it was way behind master (about two days lag), but it has since caught up. This can be easily verified by the following chart:

The above chart can be obtained by viewing the HTML report:

SELECT html FROM sv_report_html

Or by directly issuing the query:

mysql> SELECT seconds_behind_master FROM sv_report_chart_hour\G
*************************** 1. row ***************************
seconds_behind_master: http://chart.apis.google.com/chart?cht=lc&chs=400x200&chts=303030,12&chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&chdl=seconds_behind_master&chdlp=b&chco=ff8c00&chd=s:976431zzzywutrpnliiifdbZYXVTRRRPNLJHEBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&chxt=x,y&chxr=1,0,169811&chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&chxs=0,505050,10

This is all nice. But I’m also interested in the rate at which slave lag decreased. Many ignore this important metric: just how fast does your slave replicate?

Slave catchup speed

So we’re looking for the slope of the above graph. Luckily, the slope is nothing but the difference in seconds_behind_master divided by elapsed time. mycheckpoint supports the *_psec metrics for all known variables.

Again, we can just look at the HTML report, or we can issue:

mysql> SELECT seconds_behind_master_psec FROM sv_report_chart_hour\G
*************************** 1. row ***************************
seconds_behind_master_psec: http://chart.apis.google.com/chart?cht=lc&chs=400x200&chts=303030,12&chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&chdl=seconds_behind_master_psec&chdlp=b&chco=ff8c00&chd=s:fbdebXU0wvOgaYXVUIutFTacefXXS4xhYRTJAKf90444444444444444444444482444444445444444444444472544444444444444444444481544444&chxt=x,y&chxr=1,-2.35,0.19&chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&chxs=0,505050,10

And get this chart:

The numbers show a reasonable slave catchup speed (lower is better). For example, a -1 value indicates that the slave is replicating twice as fast as the master. So if we’re 10 minutes behind, catch up will occur 10 minutes from now (the slave will cover for the missing 10 minutes plus the future 10 minutes by which the master will be writing). At times, replication went very fast (slope as low as -2.25 on average), and at times is was very slow (-0.25).

But all negative numbers are good! They mean slave is faster than master. Indeed, once the slave caught up, numbers climbed to 0: the slave has since been running at same speed as the master.

Estimating slave catchup time

When we have a lagging slave, there’s nothing more interesting to known than “when will it finally catch up?”. We can do the math: calculating slope, extrapolating the expected point in time where the lag drops to 0; or we can just ask!

Let’s first ask the hard way, then the human way:

Querying the history of estimations:

SELECT ts, estimated_slave_catchup_seconds FROM sv_report_hour;
+---------------------+---------------------------------+
| ts                  | estimated_slave_catchup_seconds |
+---------------------+---------------------------------+
| 2009-11-06 16:00:00 |                          142984 |
| 2009-11-06 17:00:00 |                           47988 |
| 2009-11-06 18:00:00 |                           31416 |
| 2009-11-06 19:00:00 |                           22896 |
| 2009-11-06 20:00:00 |                           20202 |
| 2009-11-06 21:00:00 |                           13062 |
| 2009-11-06 22:00:00 |                            7932 |
| 2009-11-06 23:00:00 |                            5220 |
| 2009-11-07 00:00:00 |                            2740 |
| 2009-11-07 01:00:00 |                            NULL |
| 2009-11-07 02:00:00 |                             463 |
| 2009-11-07 03:00:00 |                            NULL |
| 2009-11-07 04:00:00 |                            NULL |
| 2009-11-07 05:00:00 |                            NULL |
| 2009-11-07 06:00:00 |                            NULL |
...
+---------------------+---------------------------------+

(NULL occurs when slave has caught up; so nothing to estimate)

Had we been checking up on human reports at the time of lag, we would see something like:

mysql> SELECT report FROM sv_report_human_hour ORDER BY ID DESC LIMIT 1\G
*************************** 1. row ***************************
report:
Report period: 2009-11-06 16:00:00 to 2009-11-06 17:00:00. Period is 60 minutes (1.00 hours)
Uptime: 100.0% (Up: 1 days, 05:34:00 hours)

...
...
...

Replication:
    Master status file number: 541, position: 825383347
    Relay log space limit: 10737418240, used: 4512632192  (42.0%)
    Seconds behind master: 47185
    Estimated time for slave to catch up: 142984 seconds (1 days, 15:43:05 hours)  ETA: 2009-11-08 07:43:05

The human report suggested it would take more than a day and a half for replication to catch up; it also provided with ETA: ‘2009-11-08 07:43:05‘.

As an interesting note, to issue the above report I went back in time and issued:

SELECT report FROM sv_report_human_hour WHERE ts='2009-11-06 16:00:00'\G

It’s possible to retroactively generate report, for as long as data still exists.

Checking up on relay log space

Slaves are limited in relay log space. They can only accumulate as much relay logs.

Usually the slave SQL thread is slower than the IO thread, which means when the slave lags behind, it writes more and more relay logs, while having hard time reading them. Once it runs out of relay log space, it stops asking the master for more binary logs, and waits for relay log space to free.

How much relay log space should a slave have?

Well, if you’re not worries about disk space, let it have as much as it would like to; but let’s look at the implications of relay log space limit. Again, by either looking at the HTML report or by directly issuing:

mysql> SELECT relay_log_used_mb FROM sv_report_chart_hour\G
*************************** 1. row ***************************
relay_log_used_mb: http://chart.apis.google.com/chart?cht=lc&chs=400x200&chts=303030,12&chtt=Nov+5,+10:00++-++Nov+10,+09:00+(4+days,+23+hours)&chdl=relay_log_space_limit_mb|relay_log_space_mb&chdlp=b&chco=ff8c00,4682b4&chd=s:999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,9999999999989570xytsuxursojfhcabWRSOJEGBCBCDEGBCEBDEFGBEGBDFBCEFGBCDEGBDFBDFCFACEGBCDEGBCBBDEFACEFBCEGBCDEFABCEBEDFACDEF&chxt=x,y&chxr=1,0,10240.10309792&chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+09:00|Nov+9,+09:00|&chxs=0,505050,10

We get:

I see that 2 days lag (comparing with the above charts) will cause the relay log space to fill up. I would consider this to be OK, since I don’t normally expect such lags. I can see that a one-day lag is well within relay log space. I conclude that the relay_log_space_limit server variable is properly configured in my case.

http://chart.apis.google.com/chart?cht=lc&chs=400×200&chts=303030,12&chtt=Nov+5,+10:00++-++Nov+10,+08:00+(4+days,+22+hours)&chdl=seconds_behind_master_psec&chdlp=b&chco=ff8c00&chd=s:fbdebXU0wvOgaYXVUIutFTacefXXS4xhYRTJAKf90444444444444444444444482444444445444444444444472544444444444444444444481544444&chxt=x,y&chxr=1,-2.35,0.19&chxl=0:||Nov+6,+09:00|Nov+7,+09:00|Nov+8,+08:00|Nov+9,+08:00|&chxs=0,505050,10

One thought on “Replication analysis with mycheckpoint

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.