I’ve followed with interest on Baron’s Why don’t our new Nagios plugins use caching? and Sheeri’s Caching for Monitoring: Timing is Everything. I wish to present my take on this, from mycheckpoint‘s point of view.
So mycheckpoint works in a completely different way. On one hand, it doesn’t bother with caching. On the other hand, it doesn’t bother with re-reads of data.
There are no staleness issues, the data is consistent as it can get (you can never get a completely atomic read of everything in MySQL), and you can issue as many calculations as you want at the price of one take of monitoring. As in Sheere’s example, you can run Threads_connected/max_connections*100, mix status variables, system variables, meta-variables (e.g. Seconds_behind_master), user-created variables (e.g. number of purchases in your online shop) etc.
mycheckpoint‘s concept is to store data. And store it in relational format. That is, INSERT it to a table.
A sample-run generates a row, which lists all status, server, OS, user, meta variables. It’s a huge row, with hundreds of columns. Columns like threads_connected, max_connections, innodb_buffer_pool_size, seconds_behind_master, etc.
mycheckpoint hardly cares about these columns. It identifies them dynamically. Have you just upgraded to MySQL 5.5? Oh, there’s a new bunch of server and status variables? No problem, mycheckpoint will notice it doesn’t have the matching columns and will add them via ALTER TABLE. There you go, now we have a place to store them.
Running a formula like Threads_connected/max_connections*100 is as easy as issuing the following query:
SELECT Threads_connected/max_connections*100 FROM status_variables WHERE id = ...
Hmmm. This means I can run this formula on the most recent row I’ve just added. But wait, this also means I can run this formula on any row I’ve ever gathered.
With mycheckpoint you can generate graphs retroactively using new formulas. The data is there, vanilla style. Any formula which can be calculated via SQL is good to go with. Plus, you get the benefit of cross referencing in fun ways: cross reference to the timestamp at which the sample was taken (so, for example, ignore the spikes generated at this and that timeframe due to maintenance. Don’t alert me on these), to system issues like load average or CPU usage (show me the average Seconds_behind_master when load average is over 8, or the average load average when slow query rate is over some threshold. You don’t do that all the time, but when you need it, well, you can get all the insight you ever wanted.
Actually storing the monitored data in an easy to access format allows one to query, re-query, re-formulate. No worries about caching, you only sample once.
For completeness, all the above is relevant when the data is of numeric types. Other types are far more complicated to manage (the list of running queries is a common example).
Sheeri,
Will be happy if you do, and happier for any feedback.
300 columns today, 305 tomorrow when you upgrade MySQL. Are you ready to generate the 5 ALTER TABLE ADD COLUMN clauses?
A couple of years ago, I looked at that and said yuck to that. I looked at key-value stores, but that performs terribly at scale. Instead, I stuffed the values into a JSON structure in a single column. Then the scripts can either do a json_decode, or a fairly simple regexp to reach into the JSON.
So, I currently have a provide the expression, the host(s) desired, timeframe, etc, and PHP does all the grunt work. It performs quite adequately.
I don’t need a WHERE clause on the status/variables.
Since I deal with hundreds of servers, from 4.0 to 5.1, and Percona, this mechanism does not care about new names. (Name changes like “table_cache” vs “table_open_cache” are a nuisance.)
@Rick,
5 columns, *ONE* ALTER TABLE 🙂
And, yes, I’m willing, since these tables are not that large. With a year’s worth of data you get ~500MB worth of data. Not that much to refactor.
I urge you to, just for the fun of it, give mycheckpoint a try. I think you’ll be surprised at the depth and detail one can fit within such a small footprint, in one python script.
I’m not belittling other monitoring tools, of course. I can’t compare mycheckpoint to cacti, they’re on a completely different scale. I’m just saying. Try it.