Is there a way to create a caching table, some sort of a materialized view, such that upon selecting from that table, its data is validated/invalidated?
Hint: yes.
But to elaborate the point: say I have some table data_table. Can I rewrite all my queries which access data_table to read from some autocache_data_table, but have nothing changed in the query itself? No caveats, no additional WHEREs, and still have that autocache_data_table provide with the correct data, dynamically updated by some rule of our choice?
And: no crontab, no event scheduler, and no funny triggers on data_table? In such way that invalidation/revalidation occurs upon SELECT?
Well, yes.
This post is long, but I suggest you read it through to understand the mechanism, it will be worthwhile.
Background
The following derives from my long research on how to provide better, faster and safer access to INFORMATION_SCHEMA tables. It is however not limited to this exact scenario, and in this post I provide with a simple, general purpose example. I’ll have more to share about INFORMATION_SCHEMA specific solutions shortly.
I was looking for a server side solution which would not require query changes, apart from directing the query to other tables. Solution has to be supported by all standard MySQL installs; so: no plugins, no special rebuilds.
Sample data
I’ll explain by walking through the solution. Let’s begin with some sample table:
CREATE TABLE sample_data ( id INT UNSIGNED NOT NULL PRIMARY KEY, dt DATETIME, msg VARCHAR(128) CHARSET ascii ); INSERT INTO sample_data VALUES (1, NOW(), 'sample txt'); INSERT INTO sample_data VALUES (2, NOW(), 'sample txt'); INSERT INTO sample_data VALUES (3, NOW(), 'sample txt'); SELECT * FROM sample_data; +----+---------------------+------------+ | id | dt | msg | +----+---------------------+------------+ | 1 | 2011-11-24 11:01:30 | sample txt | | 2 | 2011-11-24 11:01:30 | sample txt | | 3 | 2011-11-24 11:01:30 | sample txt | +----+---------------------+------------+
In this simplistic example, I wish to create a construct which looks exactly like sample_data, but which caches data according to some heuristic. It will, in fact, cache the entire content of sample_data.
That much is not a problem: just create another table to cache the data:
CREATE TABLE cache_sample_data LIKE sample_data;
The big question is: how do you make the table invalidate itself while SELECTing from it?
Here’s the deal. I’ll ask for your patience while I draw the outline, and start with failed solutions. By the end, everything will work.
Failed attempt: purge rows from the table even while reading it
My idea is to create a stored function which purges the cache_sample_data table, then fills in with fresh data, according to some heuristic. Something like this:
DELIMITER $$ CREATE FUNCTION `revalidate_cache_sample_data`() RETURNS tinyint unsigned MODIFIES SQL DATA DETERMINISTIC SQL SECURITY INVOKER BEGIN if(rand() > 0.1) then return 0; -- simplistic heuristic end if; DELETE FROM cache_sample_data; INSERT INTO cache_sample_data SELECT * FROM sample_data; RETURN 0; END $$ DELIMITER ;
So the function uses some heuristic. It’s a funny RAND() in our case; you will want to check up on time stamps, or some flags, what have you. But this is not the important part here, and I want to keep the focus on the main logic.
Upon deciding the table needs refreshing, the function purges all rows, then copies everything from sample_data. Sounds fair enough?
Let’s try and invoke it. Just write some query by hand:
mysql> SELECT revalidate_cache_sample_data(); +--------------------------------+ | revalidate_cache_sample_data() | +--------------------------------+ | 0 | +--------------------------------+ mysql> SELECT revalidate_cache_sample_data(); +--------------------------------+ | revalidate_cache_sample_data() | +--------------------------------+ | 0 | +--------------------------------+ mysql> SELECT revalidate_cache_sample_data(); +--------------------------------+ | revalidate_cache_sample_data() | +--------------------------------+ | 1 | +--------------------------------+
First two invocations – nothing. The third one indicated a revalidation of cache data. Let’s verify:
mysql> SELECT * FROM cache_sample_data; +----+---------------------+------------+ | id | dt | msg | +----+---------------------+------------+ | 1 | 2011-11-24 11:01:30 | sample txt | | 2 | 2011-11-24 11:01:30 | sample txt | | 3 | 2011-11-24 11:01:30 | sample txt | +----+---------------------+------------+
OK, seems like the function works.
We now gather some courage, and try combining calling to this function even while SELECTing from the cache table, like this:
SELECT cache_sample_data.* FROM cache_sample_data, (SELECT revalidate_cache_sample_data()) AS select_revalidate ; +----+---------------------+------------+ | id | dt | msg | +----+---------------------+------------+ | 1 | 2011-11-24 11:01:30 | sample txt | | 2 | 2011-11-24 11:01:30 | sample txt | | 3 | 2011-11-24 11:01:30 | sample txt | +----+---------------------+------------+
To explain what happens in the above query, consider its programmatic nature: we create a derived table, populated by the function’s result. That means the function is invoked in order to generate the derived table. The derived table itself must be materialized before the query begins execution, and so it is that we first invoke the function, then make the SELECT.
Don’t open the champagne yet. While the above paragraph is correct, we are deceived: in this last invocation, the function did not attempt a revalidation. The RAND() function just didn’t provide with the right value.
Let’s try again:
SELECT cache_sample_data.* FROM cache_sample_data, (SELECT revalidate_cache_sample_data()) AS select_revalidate ; ERROR 1442 (HY000): Can't update table 'cache_sample_data' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.
Aha! Bad news. The MySQL manual says on Restrictions on Stored Programs:
A stored function or trigger cannot modify a table that is already being used (for reading or writing) by the statement that invoked the function or trigger.
Anyone to the rescue?
I was quite upset. Can we not make this work? At sorrow times like these, one reflects back on words of wiser people. What would Roland Bouman say on this?
Oh, yes; he would say: “we can use a FEDERATED table which connect onto itself, thus bypass the above restriction”.
Unfortunately, FEDERATED is by default disabled nowadays; I cannot rely on its existence. Besides, to use FEDERATED one has to fill in passwords and stuff. Definitely not an out-of-the-box solution in this case.
Few more days gone by. Decided the problem cannot be solved. And then it hit me.
MyISAM to the rescue
MyISAM? Really?
Yes, and not only MyISAM, but also its cousin: it’s long abandoned cousin, forgotten once views and partitions came into MySQL. MERGE.
MERGE reflects the data contained within MyISAM tables. Perhaps the most common use for MERGE is to work out partitioned-like table of records, with MyISAM table-per month, and an overlooking MERGE table dynamically adding and removing tables from its view.
But I intend for MERGE a different use: just be an identical reflection of cache_sample_data.
So we must work out the following:
ALTER TABLE cache_sample_data ENGINE=MyISAM; CREATE TABLE cache_sample_data_wrapper LIKE cache_sample_data; ALTER TABLE cache_sample_data_wrapper ENGINE=MERGE UNION=(cache_sample_data);
I just want to verify the new table is setup correctly:
mysql> SELECT * FROM cache_sample_data_wrapper; +----+---------------------+------------+ | id | dt | msg | +----+---------------------+------------+ | 1 | 2011-11-24 11:01:30 | sample txt | | 2 | 2011-11-24 11:01:30 | sample txt | | 3 | 2011-11-24 11:01:30 | sample txt | +----+---------------------+------------+
Seems fine.
So the next step is what makes the difference: the two tables are not the same. One relies on the other, but they are distinct. Our function DELETEs from and INSERTs to cached_sample_data, but it does not affect, nor lock, cache_sample_data_wrapper.
We now rewrite our query to read:
SELECT cache_sample_data_wrapper.* FROM cache_sample_data_wrapper, (SELECT revalidate_cache_sample_data()) AS select_revalidate ;
This query is perfectly valid. It works. To illustrate, I do:
-- Try this a few times till RAND() is lucky: TRUNCATE cache_sample_data; SELECT cache_sample_data_wrapper.* FROM cache_sample_data_wrapper, (SELECT revalidate_cache_sample_data()) AS select_revalidate ; +----+---------------------+------------+ | id | dt | msg | +----+---------------------+------------+ | 1 | 2011-11-24 11:01:30 | sample txt | | 2 | 2011-11-24 11:01:30 | sample txt | | 3 | 2011-11-24 11:01:30 | sample txt | +----+---------------------+------------+
Whoa! Where did all this data come from? Didn’t we just TRUNCATE the table?
The query worked. The function re-populated cache_sample_data.
The final touch
Isn’t the above query just beautiful? I suppose not many will share my opinion. What happened to my declaration that “the original query need not be changed, apart from querying a different table”?
Yes, indeed. It’s now time for the final touch. There’s nothing amazing in this step, but we all know the way it is packaged is what makes the sale. We will now use views. We use two of them since a view must not contain a subquery in the FROM clause. Here goes:
CREATE OR REPLACE VIEW revalidate_cache_sample_data_view AS SELECT revalidate_cache_sample_data() ; CREATE OR REPLACE VIEW autocache_sample_data AS SELECT cache_sample_data_wrapper.* FROM cache_sample_data_wrapper, revalidate_cache_sample_data_view ;
And finally, we can make a very simple query like this:
SELECT * FROM autocache_sample_data;
--
-- Magic in work now!
--
+----+---------------------+------------+
| id | dt | msg |
+----+---------------------+------------+
| 1 | 2011-11-24 11:01:30 | sample txt |
| 2 | 2011-11-24 11:01:30 | sample txt |
| 3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+
Much as we would query the original sample_data table.
Summary
So what have we got? A stored routine, a MyISAM table, a MERGE table and two views. Quite a lot of constructs just to cache a table! But a beautiful cache access: plain old SQL queries. The flow looks like this:
Our cache table is a MyISAM table. It can get corrupted, which is bad. But not completely bad: it’s nothing more than a cache; we can throw away its entire data, and revalidate. We can actually ask the function to revalidate (say, pass a parameter).
Hi Shlomi!
Brilliant solution 🙂
There’s a lot of moving parts but I think this is going to be worth it if we can get better information_schema performance and reliability with this.
Hi Sholmi,
I like the idea, but I have a bad feeling about it from past experiences.
A lot depends on the cache invalidation heuristic you pointed out. If it is inefficient, you just have a lot of IO overhead where it was not needed.
Can you recommend some good cache invalidation techniques to go with this idea?
@Roland,
My next post will discuss INFORMATION_SCHEMA. I have this working, but am unsatisfied with a few issues.
@Jonathan,
Take time-based invalidation, which is commonly used, such with memcached. Data is valid for the next 5 minutes. Put the readtime somewhere.
It’s simple, yet surprisingly useful, in my opinion.
Congratulations on yet another great hack!
As I see it (and using the RAND), 10% of the time it will be even slower than the underlying query. If this is in a UI, the user will get unpredictable response time (not good).
OTOH, in a batch job, it would (could) cut down on overall system load and, on average, speed up the process.
If the RAND is replaced by some _cheap_ test for whether to refresh the cache, _and_ refreshing is costly, then “you pay the price only when you need to”. — this has applicability. But that implies that the SELECT is happening more often than changes to the underlying table. Guess what? The Query cache does an even better job of optimizing that case.
If there is a WHERE clause on the user’s query, that is not, and can not, be pushed down to the reload subquery. If it could, that would open up more doors for optimizations.
Maybe I will find a use for this eventually. It is quite clever!