The short version
I have it all working. It’s kind of magic. But there are issues, and I’m not sure it should even exist, and am looking for input.
The long version
In Auto caching tables I presented with a hack which allows getting cached or fresh results via a simple SELECT queries.
The drive for the above hack was INFORMATION_SCHEMA tables. There are two major problems with INFORMATION_SCHEMA:
- Queries on schema-oriented tables such as TABLES, COLUMNS, STATISTICS, etc. are heavyweight. How heavyweight? Enough to make a lockdown of your database. Enough to crash down your database in some cases.
- The data is always generated on-the-fly, as you request it. Query the COLUMNS table twice, and risk two lockdowns of your database.
The auto-cache mechanism solves issue #2. I have it working, time based. I have an auto-cache table for each of the INFORMATION_SCHEMA heavyweight tables. Say, every 30 minutes the cache is invalidated. Throughout those 30 minutes, you get a free pass!
The auto-cache mechanism also paves the road to solving issue #1: since it works by invoking a stored routine, I have better control of the way I read INFORMATION_SCHEMA. This, I can take advantage of INFORMATION_SCHEMA optimization. It’s tedious, but not complicated.
For example, if I wanted to cache the TABLES table, I don’t necessarily read the entire TABLES data in one read. Instead, I can iterate the schemata, get a list of table names per schema, then read full row data for these, table by table. The result? Many many more SELECTs, but more optimized, and no one-big-lock-it-all query.
And the problem is…
I have two burning problems.
- INFORMATION_SCHEMA optimization only works that much. It sometimes does not work. In particular, I’ve noticed that if you have a view which relies on another view (possibly relying on yet another view), things get out of hand. I author a monitoring tool for MySQL called mycheckpoint. It uses some fancy techniques for generating aggregated data, HTML and charts, by means of nested views. There are a few views there I can never query for in COLUMNS. It just crashes down my server. Repeatedly. And it’s a good machine with good configuration. Make that 5 machines. They all crash, repeatedly. I just can’t trust INFORMATION_SCHEMA!
- Replication: any caching table is bound to replicate. Does it make any sense to replicate cache for internal metadata? Does it make sense to query for the cached table on slave, to have it answer for master’s data? With plain old INFORMATION_SCHEMA, every server is on its own. Caching kinda works against this. Or is it fair enough, since we would usually expect master/slaves to reflect same schema structure?
I would feel much better if I could read SHOW statements with a SELECT query. Though I’ve found this nice hack, it can’t work from a stored function, only via stored procedure. So it can’t be used from within a SELECT query. I’ve been banging my head for months now, I think I gave up on this one.
Any insights are welcome!
@Jannes,
Yes, it’s just the views.
The deeper the nesting of views, the more memory is consumed, and the more danger for crash.
Just before the crash, I see about 1GB of memory getting drained. I can guess that the recursive read from I_S is the root of the problem.
1)
Q: “Does it make sense to query for the cached table on slave, to have it answer for master’s data?.. Or is it fair enough, since we would usually expect master/slaves to reflect same schema structure?” A: I don’t think so if some binlog|replicate_do|ignore_db setting is used on master or slave.
2)
Jannes was faster than me here, but building a system that depends on VIEWs defined on VIEWS (defined on .. (.. on ..)) 🙂 with MySQL is a synonym for asking for problems. The limitations with VIEWS in MySQL are well-known and anyone who has read (any edition of) ‘High Performance MySQL’ would foresee the problems arise.
Peter,
With regards to the view – they are working just great, no trouble given. It’s just the information_schema querying on views. I don’t know that should mean I was looking for trouble.
Peter, with regard to binlog filters: indeed. I’m wondering how popular this is. In my years as consultant I’ve seen little to no use of this. I believe the vast majority of replication users do not do replication filtering.
Could you make caching longer? I know it’s hard to see outside of the database, but the .frm file changes when the table structure changes. If the .frm file doesn’t change, does any of the metadata you’re referring to change? (that’s a serious question, because the answer might be “yes” if you’re looking at data/index size or the TABLE_ROWS field of the TABLES table, for example)
If you only are referring to the structure of the table, and not the nature of the data in it, then I bet you could come up with some kind of hack that writes to a common file when a .frm file has changed, and only update the cache for the tables/indexes/columns that have changed when they change, instead of re-reading all the definitions.