Performance analysis with mycheckpoint

November 12, 2009

mycheckpoint (see announcement) allows for both graph presentation and quick SQL access to monitored & analyzed data. I'd like to show the power of combining them both.

InnoDB performance

Taking a look at one of the most important InnoDB metrics: the read hit ratio (we could get the same graph by looking at the HTML report):

SELECT innodb_read_hit_percent FROM sv_report_chart_sample \G
*************************** 1. row ***************************

We see that read hit is usually high, but occasionally drops low, down to 99.7, or even 99.6. But it seems like most of the time we are above 99.95% read hit ratio. It's hard to tell about 99.98%.

Can we know for sure?

We can stress our eyes, yet be certain of little. It's best if we just query for the metrics! mycheckpoint provides with all data, accessible by simple SQL queries:

SELECT SUM(innodb_read_hit_percent > 99.95)/count(*)
  FROM sv_report_sample;
| SUM(innodb_read_hit_percent > 99.95)/count(*) |
|                                        0.7844 |

Yes, most of the time we're above 99.95% read hit ratio: but not too often!

I'm more interested in seeing how much time my server's above 99.98% read hit:

SELECT SUM(innodb_read_hit_percent > 99.98)/count(*)
  FROM sv_report_sample;
| SUM(innodb_read_hit_percent > 99.98)/count(*) |
|                                        0.3554 |

We can see the server only has 99.98% read hit percent 35% of the time. Need to work on that!

Disk activity

Lower read hit percent means higher number of disk reads; that much is obvious. The first two following graphs present this obvious connection. But the third graph tells us another fact: with increased disk I/O, we can expect more (and longer) locks.

Again, this should be very intuitive, when thinking about it this way. The problem sometimes arises when we try to analyze it the other way round: "Hey! InnoDB has a lot of locks! What are we going to do about it?". Many times, people will look for answers in their transactions, their Isolation Level, their LOCK IN SHARE MODE clauses. But the simple answer can be: "There's a lot of I/O, so everything has to wait; therefore we increase the probability for locks; therefore there's more locks".

The answer, then, is to reduce I/O. The usual stuff: slow queries; indexing; ... and, yes, perhaps transactions or tuning.

The charts below make it quite clear that we have an issue of excessive reads -> less read hit -> increased I/O -> more locks.

By the way, the above resulted from the fact that, due to a problematic query, all slave stopped replicating. Slaves participated in read-balancing, so when they went stale, all reads were directed at the master (the monitored node).

You have the metrics at your disposal

Looking at the following chart:

It appears that there's no slow queries. But this may be misleading: perhaps there's just a little, that don't show due to the chart's large scale?

One could argue that this is the chart's fault. Perhaps there should be a distinct chart for "slow queries percent". Perhaps I'll add one. But we can't have special charts for everything. It's would be too tiresome to look at hundreds of charts.

Anyway, my point is: let's verify just how many slow queries we have:

SELECT slow_queries_psec FROM sv_hour ORDER BY id DESC;
| slow_queries_psec |
|              3.05 |
|              3.83 |
|              4.39 |
|              4.03 |
|              3.86 |
|              3.56 |
|              3.73 |
|              3.79 |
|              3.58 |
|              3.55 |

So, between 3 and 4 slow queries per second. It doesn't look too good in this light. Checking on the percentage of slow queries (of total questions):

SELECT ROUND(100*slow_queries_diff/questions_diff, 1) AS slow_queries_percent
  FROM sv_hour ORDER BY id DESC LIMIT 10;

Or, since the above calculation is pre-defined in the reports tables:

SELECT slow_queries_percent FROM sv_report_hour_recent;
| slow_queries_percent |
|                  0.8 |
|                  1.0 |
|                  1.2 |
|                  1.2 |
|                  1.1 |
|                  1.0 |
|                  1.1 |
|                  1.1 |
|                  1.0 |

Accessible data

This is what I've been trying to achieve with mycheckpoint. As a DBA, consultant and SQL geek I find that direct SQL access works best for me. It's like loving command line interface over GUI tools. Direct SQL gives you so much more control and information.

Charting is important, since it's easy to watch and get first impressions, or find extreme changes. But beware of relying on charts all the time. Scale issues, misleading human interpretation, technology limitations - all these make charts inaccurate.

mycheckpoint allows for both methods, and, I believe, intuitively so.


Powered by Wordpress and MySQL. Theme by