This post exposes some of the internals, and the SQL behind QueryScript’s split. common_schema/QueryScript 1.1 introduces the split statement, which auto-breaks a “large” query (one which operates on large tables as a whole or without keys) into smaller queries, and executes them in sequence.
This makes for easier transactions, less locks held, potentially (depending on the user) more idle time released back to the database. split has similar concepts to oak-chunk-update and pt-archiver, but works differently, and implemented entirely in SQL on server side.
Take the following statement as example:
split (UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR) pass;
It yields with (roughly) the following statements:
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1')) OR ((`inventory`.`inventory_id` = '1'))) AND (((`inventory`.`inventory_id` < '1000')) OR ((`inventory`.`inventory_id` = '1000')))); UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1000'))) AND (((`inventory`.`inventory_id` < '2000')) OR ((`inventory`.`inventory_id` = '2000')))); UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '2000'))) AND (((`inventory`.`inventory_id` < '3000')) OR ((`inventory`.`inventory_id` = '3000')))); UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '3000'))) AND (((`inventory`.`inventory_id` < '4000')) OR ((`inventory`.`inventory_id` = '4000')))); UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '4000'))) AND (((`inventory`.`inventory_id` < '4581')) OR ((`inventory`.`inventory_id` = '4581'))));
(I say “roughly” because internally there are user defined variables at play, but for convenience, I verbose the actual values as constants.)
How does that work?
common_schema works on server side. There is no Perl script or anything. It must therefore use server-side operations to:
- Identify table to be split
- Analyze the table in the first place, deciding how to split it
- Analyze the query, deciding on how to rewrite it
- Split the table (logically) into unique and distinct chunks
- Work out the query on each such chunk
Following is an internal look at how common_schema does all the above.
Identifying the table
When query operates on a single table, split is able to parse the query’s SQL and find out that table. When multiple tables are involved, split requires user instruction: which table is it that the query should be split by?
Analyzing the table
Table analysis is done via a similar method to candidate_keys_recommended. It is almost identical, only it uses INFORMATION_SCHEMA optimizations to make the query short and lightweight. Simulating the analysis using candidate_keys_recommended, we get:
mysql> select * from candidate_keys_recommended where table_name='inventory' \G *************************** 1. row *************************** table_schema: sakila table_name: inventory recommended_index_name: PRIMARY has_nullable: 0 is_primary: 1 count_column_in_index: 1 column_names: inventory_id
This is cool, simple and very easy to work with: we choose to split the table via the inventory_id column, which is conveniently an integer. We’ll soon see split can handle complex cases as well.
Analyzing the query
This is done in part via Roland’s query_analysis_routines, and in part just parsing the query, looking for WHERE, GROUP BY, LIMIT etc. clauses.
The nice part is injecting a WHERE condition, which didn’t appear in the original query. That WHERE condition is what limits the query to a distinct chunk of rows.
Splitting the table
With a single INTEGER PRIMARY KEY this sounds simple, right? Take rows 1..1,000, then 1,001..2,000, then 2,001..3,000 etc.
Wrong: even with this simple scenario, things are much more complex. Are the numbers successive? What if there are holes? What if there is a 1,000,000 gap between every two numbers? What if there are multiple holes of differing size and frequency?
And if we have two columns in our UNIQUE KEY? What if one of them is textual, not an INTEGER, the other a TIMESTAMP, not an INTEGER either?
split doesn’t work in that naive way. It makes no assumptions on the density of values. It only requires:
- some UNIQUE KEY to work with,
- which has no NULL values.
Given the above, it uses User Defined Variables to setup the chunks. With our single INTEGER column, the minimum value is set like this:
select inventory_id from `sakila`.`inventory` order by inventory_id ASC limit 1 into @_split_column_variable_min_1 ;
This sets the first value of the first chunk. What value terminates this chunk? It is calculated like this:
select inventory_id from ( select inventory_id from `sakila`.`inventory` where (((`inventory`.`inventory_id` > @_split_column_variable_range_start_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_range_start_1))) and (((`inventory`.`inventory_id` < @_split_column_variable_max_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_max_1))) order by inventory_id ASC limit 1000 ) sel_split_range order by inventory_id DESC limit 1 into @_split_column_variable_range_end_1 ;
Now there’s a query you wouldn’t want to work by hand, now would you?
The cool part here is that the above works well for any type of column; this doesn’t have to be an INTEGER. Dates, strings etc. are all just fine.
The above also works well for multiple columns, where the query gets more complicated (see following).
Working out the query per chunk
This part is the easy one, now that all the hard work is done. We know ho to manipulate the query, we know the lower and upper boundaries of the chunk, so we just fill in the values and execute.
Multi-columns keys
Consider a similar query on sakila.film_actor, where the PRIMARY KEY is a compound of two columns:
split (UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR) throttle 2;
The chunked queries will look like this:
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` = '1'))) AND (((`film_actor`.`actor_id` < '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` < '293')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` = '293')))); UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` > '293'))) AND (((`film_actor`.`actor_id` < '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` < '234')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` = '234')))); UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` > '234'))) AND (((`film_actor`.`actor_id` < '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` < '513')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` = '513')))); UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` > '513'))) AND (((`film_actor`.`actor_id` < '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` < '278')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` = '278')))); UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` > '278'))) AND (((`film_actor`.`actor_id` < '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` < '862')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` = '862')))); UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` > '862'))) AND (((`film_actor`.`actor_id` < '200')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` < '993')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` = '993'))));
View the complete command to realize just how much more complex each query is, and how much more complex the chunking becomes. Here’s how I evaluate the chunk’s “next range end” variables:
select actor_id, film_id from ( select actor_id, film_id from `sakila`.`film_actor` where (((`film_actor`.`actor_id` > @_split_column_variable_range_start_1)) OR ((`film_actor`. `actor_id` = @_split_column_variable_range_start_1) AND (`film_actor`.`film_id` > @_split_column_variable_range_start_2))) and (((`film_actor`.`actor_id` < @_split_column_variable_max_1)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` < @_split_column_variable_max_2)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` = @_split_column_variable_max_2))) order by actor_id ASC, film_id ASC limit 1000 ) sel_split_range order by actor_id DESC, film_id DESC limit 1 into @_split_column_variable_range_end_1, @_split_column_variable_range_end_2 ;
By the way, you may recall that everything is done server side. The WHERE condition for the chunked queries is in itself generated via SQL statement, and not too much by programmatic logic. Here’s part of the query which computes the limiting condition:
select group_concat('(', partial_comparison, ')' order by n separator ' OR ') as comparison from ( select n, group_concat('(', column_name, ' ', if(is_last, comparison_operator, '='), ' ', variable_name, ')' order by column_order separator ' AND ') as partial_comparison from ( select n, CONCAT(mysql_qualify(split_table_name), '.', mysql_qualify(column_name)) AS column_name, case split_variable_type when 'range_start' then range_start_variable_name when 'range_end' then range_end_variable_name when 'max' then max_variable_name end as variable_name, _split_column_names_table.column_order, _split_column_names_table.column_order = n as is_last from numbers, _split_column_names_table where n between _split_column_names_table.column_order and num_split_columns order by n, _split_column_names_table.column_order ) s1 group by n ) s2 into return_value ;
There is a lot of complexity to split to make it able to provide with as clean a syntax for the user as possible.
This is, in my opinion, a better way to ‘chunk’ an ID that is not necessarily dense:
$idz = SELECT id FROM tbl WHERE id > $ida LIMIT 1000,1;
Then you proceed to do
SELECT … WHERE id >= $ida AND id <= $idz
The first query can (in some situations) operate only in the index, thereby not slogging through all the data. And it does not need to build a temp table.
In the 2-column case, things get messier. I prefer to discover which case exists and act accordingly:
WHERE col1 = 123 AND col2 BETWEEN… — when 123 has more than 1000 rows.
WHERE col1 BETWEEN… — no mention of col2
@Rick,
Your solution (1 column) misses the case where there aren’t 1000 rows. The LIMIT 1000,1 will return an empty result set.
In the 2-column type, what you are suggesting leads to unbalanced chunks. You will sometimes get 1000 rows, sometimes less, sometimes get many single ones… you have to test beforehand: “does this value have more than 100 rows? Does that one? Does the next one? If they don’t, we aggregate…”.
This is much more work than the way I present, where I use the same technique to detect “the last of the next 1000 rows by order of key”.
Although using a temporary table, the fact we’re only examining 1000 rows makes it small. I’ve been using this technique for over 3 years on production (and so has anyone using the openark-kit, and in particular oak-chunk-update and oak-online-alter-table), and found it to be satisfying. The index is utilized well for looking up the 1000 rows. These are then sorted in reverse order without an index. This solution makes for a consistent algorithm, which does not care about your density, and does not need to diagnose it (avoiding additional lookup SELECTs overhead).
How would you solve a 2-column case where one of the columns is not an integer? A 3-column case? The algorithm used in common_schema & openark-kit doesn’t care. They all work the same way.
> Your solution (1 column) misses the case where there aren’t 1000 rows.
“The devil is in the details.” Also misses the last chunk.
> unbalanced chunks.
I don’t see that as a problem. In ‘real’ cases, I find that the average tends to be over 90% of the requested chunk size (1000).
In extreme cases, some chunks might have only 1 row.
My algorithm does not care how what the ordering is (int, string, etc) done — because ORDER BY, LIMIT, WHERE …>… take care of it.
N columns — the WHERE clause id dynamically created and has a variable number of ANDs in it.
Nothing sacred about 1000. However, it should not be “too big” if the queries will be replicated — else other queries will be delayed in the single-threaded replication stream.
I was merely pointing out a different way to ‘skin the cat’, not saying that openark-kiit is in any way ‘wrong’.
One thing to add to your UPDATE example (if using InnoDB). If the set of UPDATEs in wrapped in BEGIN..COMMIT, much of the savings is lost. Without the transaction, ACID across the set is lost. It is a tradeoff that the user should be aware of.
This comment may be more important — It is about the inefficient optimization of OR versus AND (at least in older versions of MySQL.
You have:
WHERE
((actor_id > 1) OR (actor_id = 1 AND film_id > 1) OR (actor_id = 1 AND film_id = 1))
AND
((actor_id < 39) OR (actor_id = 39 AND film_id = 1) AND (film_id >= 1 OR actor_id > 1)
AND
(actor_id <= 39) AND (actor_id < 39 OR film_id <= 293)
By playing with De Morgan's law, I have bubbled ANDs to the top, thereby letting it bound the scan over actor_id BETWEEN 1 and 39. Within 1 and 39, it still has to further filter down by film_id.
Run EXPLAIN. Check the delta for VARIABLES LIKE 'Handler%'.
(My chunking technique avoids the need to skip over unnecessary film_id values in the main query.)
@Rick,
Thanks. I didn’t feel like you were saying “openark-kit is wrong”, all is cool and I appreciate your comments 🙂
split() does not work within transaction scope: it breaks a transaction. I have presented the “what it looks like”; but the multiple UPDATE statements do not all run within the same transaction. Thank you for pointing out.
As per OR vs. AND query, I made this one due to inefficiency of MySQL handling compound column comparison, e.g.:
WHERE (a,b) >= (6,13) AND (a,b) < (19,3) For some reason MySQL is unable to use an index for the above. My tests show good evaluation plans and index usage for such query conditions. I will check for handler_%. Thanks for suggesting this.