MySQL monitoring: storing, not caching

February 22, 2012

I've followed with interest on Baron's Why don’t our new Nagios plugins use caching? and Sheeri's Caching for Monitoring: Timing is Everything. I wish to present my take on this, from mycheckpoint's point of view.

So mycheckpoint works in a completely different way. On one hand, it doesn't bother with caching. On the other hand, it doesn't bother with re-reads of data.

There are no staleness issues, the data is consistent as it can get (you can never get a completely atomic read of everything in MySQL), and you can issue as many calculations as you want at the price of one take of monitoring. As in Sheere's example, you can run Threads_connected/max_connections*100, mix status variables, system variables, meta-variables (e.g. Seconds_behind_master), user-created variables (e.g. number of purchases in your online shop) etc.

mycheckpoint's concept is to store data. And store it in relational format. That is, INSERT it to a table.

A sample-run generates a row, which lists all status, server, OS, user, meta variables. It's a huge row, with hundreds of columns. Columns like threads_connected, max_connections, innodb_buffer_pool_size, seconds_behind_master, etc.

mycheckpoint hardly cares about these columns. It identifies them dynamically. Have you just upgraded to MySQL 5.5? Oh, there's a new bunch of server and status variables? No problem, mycheckpoint will notice it doesn't have the matching columns and will add them via ALTER TABLE. There you go, now we have a place to store them.

Running a formula like Threads_connected/max_connections*100 is as easy as issuing the following query:

SELECT Threads_connected/max_connections*100 FROM status_variables WHERE id = ...

Hmmm. This means I can run this formula on the most recent row I've just added. But wait, this also means I can run this formula on any row I've ever gathered.

With mycheckpoint you can generate graphs retroactively using new formulas. The data is there, vanilla style. Any formula which can be calculated via SQL is good to go with. Plus, you get the benefit of cross referencing in fun ways: cross reference to the timestamp at which the sample was taken (so, for example, ignore the spikes generated at this and that timeframe due to maintenance. Don't alert me on these), to system issues like load average or CPU usage (show me the average Seconds_behind_master when load average is over 8, or the average load average when slow query rate is over some threshold. You don't do that all the time, but when you need it, well, you can get all the insight you ever wanted.

Actually storing the monitored data in an easy to access format allows one to query, re-query, re-formulate. No worries about caching, you only sample once.

For completeness, all the above is relevant when the data is of numeric types. Other types are far more complicated to manage (the list of running queries is a common example).

  • And in this respect MONyog and mycheckpoint seem to use same concept as regards storing versus caching. However the schemas are different and the 'division of labor' between SQL and javascript is also not exactly the same. Refer: .. (slightly abbreviated)

    1) Everytime MONyog sends a statement like SHOW VARIABLES/STATUS of some kind (or fetching a OS metric from Linux /proc folder) one row is INSERTED into [timestamp_master] table with information about current time and the metrics retrieved for each such statement will be INSERTED into [snapshot_master]. The [snapshot_master] table contains the metrics details.

    2) And actually we do not always INSERT into [snapshot_master]. Only if the particular metric was changed since last time something was INSERTED for that metric we will INSERT again.

    3) Finally note that [snapshot_master] does not have the names of the metrics. It is not possible to know in advance what metrics the server will return as it depends on server details (version and configuration). And actually a server may be upgraded.

  • It sounds like mycheckpoint has a very wide table - one column for each status variable? There are over 300 of those....and another 300+ system variables.

    I think it makes sense that 3 tools that do different things approach the problem in 3 different ways:

    - The Percona Nagios plugin set is not concerned with caching, because it does not reuse any of the information it retrieves, it only uses the information once.

    - The PalominoDB Nagios plugin caches because the same plugin can be used over and over to check different calculations of variables, and is designed so that the check could be run efficiently within a check interval. Ironically it was designed this way because of the benefits realized from the MySQL Cacti Templates.

    - mycheckpoint stores because the data is used for graphing, but can also be used for alerting. In a way this is like cacti itself - it stores the data for graphing, and can also be used for alerting.

    I made my post because I was worried folks would think "caching for monitoring is bad" - Baron specifically said that the Percona plugins are run infrequently, but I know there are many folks out there that would quickly turn his post into a "caching for monitoring is bad" myth.

  • I use mycheckpoint and its brilliant.
    I set it up to send me emails with reports about the servers I monitor instead of me logging into trending tools (I'm lazy that way).

    I have already been using it with the user_statistics patch to see trends in some of my top tables.
    I guess I could also monitor mycheckpoint data itself to get more info from it, should I choose to.

  • @Sheeri,
    not every single variable is monitored. Some are known to be skipped, such as ssl_* and other textual columns.
    On a Percona Server 5.1 installation I have 480 monitored data columns, including some custom counters added by the user.

  • Shlomi - ah, that's good! One day I'll get around to implementing your tools in my environment (now that I HAVE one that I'm in charge of at Mozilla! Consulting is great but it means you never really own anything and research isn't always easy to do)....for now it's back to basics for me, I'm still figuring out which machines I'm responsible for, and it's not a small amount!

  • Sheeri,
    Will be happy if you do, and happier for any feedback.

  • 300 columns today, 305 tomorrow when you upgrade MySQL. Are you ready to generate the 5 ALTER TABLE ADD COLUMN clauses?

    A couple of years ago, I looked at that and said yuck to that. I looked at key-value stores, but that performs terribly at scale. Instead, I stuffed the values into a JSON structure in a single column. Then the scripts can either do a json_decode, or a fairly simple regexp to reach into the JSON.

    So, I currently have a provide the expression, the host(s) desired, timeframe, etc, and PHP does all the grunt work. It performs quite adequately.

    I don't need a WHERE clause on the status/variables.

    Since I deal with hundreds of servers, from 4.0 to 5.1, and Percona, this mechanism does not care about new names. (Name changes like "table_cache" vs "table_open_cache" are a nuisance.)

  • @Rick,
    5 columns, *ONE* ALTER TABLE 🙂
    And, yes, I'm willing, since these tables are not that large. With a year's worth of data you get ~500MB worth of data. Not that much to refactor.

    I urge you to, just for the fun of it, give mycheckpoint a try. I think you'll be surprised at the depth and detail one can fit within such a small footprint, in one python script.

    I'm not belittling other monitoring tools, of course. I can't compare mycheckpoint to cacti, they're on a completely different scale. I'm just saying. Try it.

Powered by Wordpress and MySQL. Theme by