Auto caching tables

March 6, 2012

Is there a way to create a caching table, some sort of a materialized view, such that upon selecting from that table, its data is validated/invalidated?

Hint: yes.

But to elaborate the point: say I have some table data_table. Can I rewrite all my queries which access data_table to read from some autocache_data_table, but have nothing changed in the query itself? No caveats, no additional WHEREs, and still have that autocache_data_table provide with the correct data, dynamically updated by some rule of our choice?

And: no crontab, no event scheduler, and no funny triggers on data_table? In such way that invalidation/revalidation occurs upon SELECT?

Well, yes.

This post is long, but I suggest you read it through to understand the mechanism, it will be worthwhile.

Background

The following derives from my long research on how to provide better, faster and safer access to INFORMATION_SCHEMA tables. It is however not limited to this exact scenario, and in this post I provide with a simple, general purpose example. I'll have more to share about INFORMATION_SCHEMA specific solutions shortly.

I was looking for a server side solution which would not require query changes, apart from directing the query to other tables. Solution has to be supported by all standard MySQL installs; so: no plugins, no special rebuilds.

Sample data

I'll explain by walking through the solution. Let's begin with some sample table:

CREATE TABLE sample_data (
  id INT UNSIGNED NOT NULL PRIMARY KEY,
  dt DATETIME,
  msg VARCHAR(128) CHARSET ascii
);

INSERT INTO sample_data VALUES (1, NOW(), 'sample txt');
INSERT INTO sample_data VALUES (2, NOW(), 'sample txt');
INSERT INTO sample_data VALUES (3, NOW(), 'sample txt');

SELECT * FROM sample_data;
+----+---------------------+------------+
| id | dt                  | msg        |
+----+---------------------+------------+
|  1 | 2011-11-24 11:01:30 | sample txt |
|  2 | 2011-11-24 11:01:30 | sample txt |
|  3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+

In this simplistic example, I wish to create a construct which looks exactly like sample_data, but which caches data according to some heuristic. It will, in fact, cache the entire content of sample_data.

That much is not a problem: just create another table to cache the data:

CREATE TABLE cache_sample_data LIKE sample_data;

The big question is: how do you make the table invalidate itself while SELECTing from it?

Here's the deal. I'll ask for your patience while I draw the outline, and start with failed solutions. By the end, everything will work.

Failed attempt: purge rows from the table even while reading it

My idea is to create a stored function which purges the cache_sample_data table, then fills in with fresh data, according to some heuristic. Something like this:

DELIMITER $$

CREATE FUNCTION `revalidate_cache_sample_data`() RETURNS tinyint unsigned
    MODIFIES SQL DATA
    DETERMINISTIC
    SQL SECURITY INVOKER
BEGIN
  if(rand() > 0.1) then
    return 0; -- simplistic heuristic
  end if;

  DELETE FROM cache_sample_data;
  INSERT INTO cache_sample_data SELECT * FROM sample_data;
  RETURN 0;
END $$

DELIMITER ;

So the function uses some heuristic. It's a funny RAND() in our case; you will want to check up on time stamps, or some flags, what have you. But this is not the important part here, and I want to keep the focus on the main logic.

Upon deciding the table needs refreshing, the function purges all rows, then copies everything from sample_data. Sounds fair enough?

Let's try and invoke it. Just write some query by hand:

mysql> SELECT revalidate_cache_sample_data();
+--------------------------------+
| revalidate_cache_sample_data() |
+--------------------------------+
|                              0 |
+--------------------------------+

mysql> SELECT revalidate_cache_sample_data();
+--------------------------------+
| revalidate_cache_sample_data() |
+--------------------------------+
|                              0 |
+--------------------------------+

mysql> SELECT revalidate_cache_sample_data();
+--------------------------------+
| revalidate_cache_sample_data() |
+--------------------------------+
|                              1 |
+--------------------------------+

First two invocations - nothing. The third one indicated a revalidation of cache data. Let's verify:

mysql> SELECT * FROM cache_sample_data;
+----+---------------------+------------+
| id | dt                  | msg        |
+----+---------------------+------------+
|  1 | 2011-11-24 11:01:30 | sample txt |
|  2 | 2011-11-24 11:01:30 | sample txt |
|  3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+

OK, seems like the function works.

We now gather some courage, and try combining calling to this function even while SELECTing from the cache table, like this:

SELECT
  cache_sample_data.*
FROM
  cache_sample_data,
  (SELECT revalidate_cache_sample_data()) AS select_revalidate
;
+----+---------------------+------------+
| id | dt                  | msg        |
+----+---------------------+------------+
|  1 | 2011-11-24 11:01:30 | sample txt |
|  2 | 2011-11-24 11:01:30 | sample txt |
|  3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+

To explain what happens in the above query, consider its programmatic nature: we create a derived table, populated by the function's result. That means the function is invoked in order to generate the derived table. The derived table itself must be materialized before the query begins execution, and so it is that we first invoke the function, then make the SELECT.

Don't open the champagne yet. While the above paragraph is correct, we are deceived: in this last invocation, the function did not attempt a revalidation. The RAND() function just didn't provide with the right value.

Let's try again:

SELECT
  cache_sample_data.*
FROM
  cache_sample_data,
  (SELECT revalidate_cache_sample_data()) AS select_revalidate
;
ERROR 1442 (HY000): Can't update table 'cache_sample_data' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.

Aha! Bad news. The MySQL manual says on Restrictions on Stored Programs:

A stored function or trigger cannot modify a table that is already being used (for reading or writing) by the statement that invoked the function or trigger.

Anyone to the rescue?

I was quite upset. Can we not make this work? At sorrow times like these, one reflects back on words of wiser people. What would Roland Bouman say on this?

Oh, yes; he would say: "we can use a FEDERATED table which connect onto itself, thus bypass the above restriction".

Unfortunately, FEDERATED is by default disabled nowadays; I cannot rely on its existence. Besides, to use FEDERATED one has to fill in passwords and stuff. Definitely not an out-of-the-box solution in this case.

Few more days gone by. Decided the problem cannot be solved. And then it hit me.

MyISAM to the rescue

MyISAM? Really?

Yes, and not only MyISAM, but also its cousin: it's long abandoned cousin, forgotten once views and partitions came into MySQL. MERGE.

MERGE reflects the data contained within MyISAM tables. Perhaps the most common use for MERGE is to work out partitioned-like table of records, with MyISAM table-per month, and an overlooking MERGE table dynamically adding and removing tables from its view.

But I intend for MERGE a different use: just be an identical reflection of cache_sample_data.

So we must work out the following:

ALTER TABLE cache_sample_data ENGINE=MyISAM;
CREATE TABLE cache_sample_data_wrapper LIKE cache_sample_data;
ALTER TABLE cache_sample_data_wrapper ENGINE=MERGE UNION=(cache_sample_data);

I just want to verify the new table is setup correctly:

mysql> SELECT * FROM cache_sample_data_wrapper;
+----+---------------------+------------+
| id | dt                  | msg        |
+----+---------------------+------------+
|  1 | 2011-11-24 11:01:30 | sample txt |
|  2 | 2011-11-24 11:01:30 | sample txt |
|  3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+

Seems fine.

So the next step is what makes the difference: the two tables are not the same. One relies on the other, but they are distinct. Our function DELETEs from and INSERTs to cached_sample_data, but it does not affect, nor lock, cache_sample_data_wrapper.

We now rewrite our query to read:

SELECT
  cache_sample_data_wrapper.*
FROM
  cache_sample_data_wrapper,
  (SELECT revalidate_cache_sample_data()) AS select_revalidate
;

This query is perfectly valid. It works. To illustrate, I do:

-- Try this a few times till RAND() is lucky:

TRUNCATE cache_sample_data;

SELECT
  cache_sample_data_wrapper.*
FROM
  cache_sample_data_wrapper,
  (SELECT revalidate_cache_sample_data()) AS select_revalidate
;
+----+---------------------+------------+
| id | dt                  | msg        |
+----+---------------------+------------+
|  1 | 2011-11-24 11:01:30 | sample txt |
|  2 | 2011-11-24 11:01:30 | sample txt |
|  3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+

Whoa! Where did all this data come from? Didn't we just TRUNCATE the table?

The query worked. The function re-populated cache_sample_data.

The final touch

Isn't the above query just beautiful? I suppose not many will share my opinion. What happened to my declaration that "the original query need not be changed, apart from querying a different table"?

Yes, indeed. It's now time for the final touch. There's nothing amazing in this step, but we all know the way it is packaged is what makes the sale. We will now use views. We use two of them since a view must not contain a subquery in the FROM clause. Here goes:

CREATE OR REPLACE VIEW revalidate_cache_sample_data_view AS
  SELECT revalidate_cache_sample_data()
;

CREATE OR REPLACE VIEW autocache_sample_data AS
  SELECT
    cache_sample_data_wrapper.*
  FROM
    cache_sample_data_wrapper,
    revalidate_cache_sample_data_view
;

And finally, we can make a very simple query like this:

SELECT * FROM autocache_sample_data;
--
-- Magic in work now!
--
+----+---------------------+------------+
| id | dt                  | msg        |
+----+---------------------+------------+
|  1 | 2011-11-24 11:01:30 | sample txt |
|  2 | 2011-11-24 11:01:30 | sample txt |
|  3 | 2011-11-24 11:01:30 | sample txt |
+----+---------------------+------------+

Much as we would query the original sample_data table.

Summary

So what have we got? A stored routine, a MyISAM table, a MERGE table and two views. Quite a lot of constructs just to cache a table! But a beautiful cache access: plain old SQL queries. The flow looks like this:

Our cache table is a MyISAM table. It can get corrupted, which is bad. But not completely bad: it's nothing more than a cache; we can throw away its entire data, and revalidate. We can actually ask the function to revalidate (say, pass a parameter).

tags: , , , ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

9 Comments to "Auto caching tables"

  1. Roland Bouman wrote:

    Hi Shlomi!

    Brilliant solution :)

    There's a lot of moving parts but I think this is going to be worth it if we can get better information_schema performance and reliability with this.

  2. Jonathan Levin wrote:

    Hi Sholmi,

    I like the idea, but I have a bad feeling about it from past experiences.
    A lot depends on the cache invalidation heuristic you pointed out. If it is inefficient, you just have a lot of IO overhead where it was not needed.

    Can you recommend some good cache invalidation techniques to go with this idea?

  3. shlomi wrote:

    @Roland,
    My next post will discuss INFORMATION_SCHEMA. I have this working, but am unsatisfied with a few issues.

    @Jonathan,
    Take time-based invalidation, which is commonly used, such with memcached. Data is valid for the next 5 minutes. Put the readtime somewhere.
    It's simple, yet surprisingly useful, in my opinion.

  4. Giuseppe Maxia wrote:

    Congratulations on yet another great hack!

  5. Rick James wrote:

    As I see it (and using the RAND), 10% of the time it will be even slower than the underlying query. If this is in a UI, the user will get unpredictable response time (not good).

    OTOH, in a batch job, it would (could) cut down on overall system load and, on average, speed up the process.

    If the RAND is replaced by some _cheap_ test for whether to refresh the cache, _and_ refreshing is costly, then "you pay the price only when you need to". — this has applicability. But that implies that the SELECT is happening more often than changes to the underlying table. Guess what? The Query cache does an even better job of optimizing that case.

    If there is a WHERE clause on the user's query, that is not, and can not, be pushed down to the reload subquery. If it could, that would open up more doors for optimizations.

    Maybe I will find a use for this eventually. It is quite clever!

  6. shlomi wrote:

    @Rick,
    I'm utterly not sold on "The Query cache does an even better job of optimizing that case", for many reasons.
    One being that the query cache results get invalidated so frequently, and by such innocent updates, that it quickly becomes useless. At least this is my experience.

    I agree that the stored routine makes for an overhead. MySQL is very slow in interpreting stored routines.

    The RAND is only used because it's simplest to demonstrate with, and easiest to follow. It does not make sense to use RAND for real. Thus, the "10% of the time it will be even slower" is irrelevant. It should be "Once every #### times the query is even slower than it would without cache".

  7. Jannes wrote:

    Wow nice! I've been trying to attack this problem from similar sides a few times as well, but hadn't made it fit together so nicely!

    1) I guess you could let your function return a 1 in case the cache was just freshly revalidated? Or a timestamp of the last revalidation and then the client running the query could SELECT that field as well if it were interested in that information?

    2) A TRUNCATE would be faster than DELETE FROM cache_sample_data; right? (IF you want to delete everything). Of course you could have logic that allows you to update only what has changed since the previous revalidation.

    3) I guess it's not possible to ONLY revalidate rows that the client is interested in (i.e. the rest of the WHERE clause in the client's query) ?

    Anyway, congrats for pulling this off!

  8. shlomi wrote:

    @Jannes,
    Thanks

    1. Yes, indeed.
    2. True, though the above is merely a demo. You may wish to only invalidate specific rows
    3. Yes and no. You can use this_query() to get the invocating query. If you're up to parsing it, well, it can be done.

  9. Auto caching INFORMATION_SCHEMA tables: seeking input | code.openark.org wrote:

    [...] Auto caching tables [...]

Leave Your Comment

 

 
Powered by Wordpress and MySQL. Theme by openark.org