How common_schema split()s tables – internals

This post exposes some of the internals, and the SQL behind QueryScript’s split. common_schema/QueryScript 1.1 introduces the split statement, which auto-breaks a “large” query (one which operates on large tables as a whole or without keys) into smaller queries, and executes them in sequence.

This makes for easier transactions, less locks held, potentially (depending on the user) more idle time released back to the database. split has similar concepts to oak-chunk-update and pt-archiver, but works differently, and implemented entirely in SQL on server side.

Take the following statement as example:

split (UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR)
  pass;

It yields with (roughly) the following statements:

UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1')) OR ((`inventory`.`inventory_id` = '1'))) AND (((`inventory`.`inventory_id` < '1000')) OR ((`inventory`.`inventory_id` = '1000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1000'))) AND (((`inventory`.`inventory_id` < '2000')) OR ((`inventory`.`inventory_id` = '2000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '2000'))) AND (((`inventory`.`inventory_id` < '3000')) OR ((`inventory`.`inventory_id` = '3000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '3000'))) AND (((`inventory`.`inventory_id` < '4000')) OR ((`inventory`.`inventory_id` = '4000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '4000'))) AND (((`inventory`.`inventory_id` < '4581')) OR ((`inventory`.`inventory_id` = '4581'))));

(I say “roughly” because internally there are user defined variables at play, but for convenience, I verbose the actual values as constants.)

How does that work?

common_schema works on server side. There is no Perl script or anything. It must therefore use server-side operations to:

  • Identify table to be split
  • Analyze the table in the first place, deciding how to split it
  • Analyze the query, deciding on how to rewrite it
  • Split the table (logically) into unique and distinct chunks
  • Work out the query on each such chunk

Following is an internal look at how common_schema does all the above.

Identifying the table

When query operates on a single table, split is able to parse the query’s SQL and find out that table. When multiple tables are involved, split requires user instruction: which table is it that the query should be split by?

Analyzing the table

Table analysis is done via a similar method to candidate_keys_recommended. It is almost identical, only it uses INFORMATION_SCHEMA optimizations to make the query short and lightweight. Simulating the analysis using candidate_keys_recommended, we get:

mysql> select * from candidate_keys_recommended where table_name='inventory' \G
*************************** 1. row ***************************
          table_schema: sakila
            table_name: inventory
recommended_index_name: PRIMARY
          has_nullable: 0
            is_primary: 1
 count_column_in_index: 1
          column_names: inventory_id

This is cool, simple and very easy to work with: we choose to split the table via the inventory_id column, which is conveniently an integer. We’ll soon see split can handle complex cases as well.

Analyzing the query

This is done in part via Roland’s query_analysis_routines, and in part just parsing the query, looking for WHERE, GROUP BY, LIMIT etc. clauses.

The nice part is injecting a WHERE condition, which didn’t appear in the original query. That WHERE condition is what limits the query to a distinct chunk of rows.

Splitting the table

With a single INTEGER PRIMARY KEY this sounds simple, right? Take rows 1..1,000, then 1,001..2,000, then 2,001..3,000 etc.

Wrong: even with this simple scenario, things are much more complex. Are the numbers successive? What if there are holes? What if there is a 1,000,000 gap between every two numbers? What if there are multiple holes of differing size and frequency?

And if we have two columns in our UNIQUE KEY? What if one of them is textual, not an INTEGER, the other a TIMESTAMP, not an INTEGER either?

split doesn’t work in that naive way. It makes no assumptions on the density of values. It only requires:

  • some UNIQUE KEY to work with,
  • which has no NULL values.

Given the above, it uses User Defined Variables to setup the chunks. With our single INTEGER column, the minimum value is set like this:

select 
  inventory_id 
from 
  `sakila`.`inventory` 
order by 
  inventory_id ASC 
limit 1  
into @_split_column_variable_min_1
;

This sets the first value of the first chunk. What value terminates this chunk? It is calculated like this:

select 
  inventory_id 
from (
  select 
    inventory_id 
  from 
    `sakila`.`inventory` 
  where 
    (((`inventory`.`inventory_id` > @_split_column_variable_range_start_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_range_start_1))) and (((`inventory`.`inventory_id` < @_split_column_variable_max_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_max_1))) 
  order by 
    inventory_id ASC limit 1000 
  ) sel_split_range  
order by 
  inventory_id DESC 
limit 1  
into @_split_column_variable_range_end_1
;

Now there’s a query you wouldn’t want to work by hand, now would you?

The cool part here is that the above works well for any type of column; this doesn’t have to be an INTEGER. Dates, strings etc. are all just fine.

The above also works well for multiple columns, where the query gets more complicated (see following).

Working out the query per chunk

This part is the easy one, now that all the hard work is done. We know ho to manipulate the query, we know the lower and upper boundaries of the chunk, so we just fill in the values and execute.

Multi-columns keys

Consider a similar query on sakila.film_actor, where the PRIMARY KEY is a compound of two columns:

split (UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR)
  throttle 2;

The chunked queries will look like this:

UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` = '1'))) AND (((`film_actor`.`actor_id` < '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` < '293')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` = '293'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` > '293'))) AND (((`film_actor`.`actor_id` < '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` < '234')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` = '234'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` > '234'))) AND (((`film_actor`.`actor_id` < '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` < '513')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` = '513'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` > '513'))) AND (((`film_actor`.`actor_id` < '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` < '278')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` = '278'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` > '278'))) AND (((`film_actor`.`actor_id` < '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` < '862')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` = '862'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` > '862'))) AND (((`film_actor`.`actor_id` < '200')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` < '993')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` = '993'))));

View the complete command to realize just how much more complex each query is, and how much more complex the chunking becomes. Here’s how I evaluate the chunk’s “next range end” variables:

select 
  actor_id, film_id 
from (
  select 
    actor_id, film_id 
  from 
    `sakila`.`film_actor` 
  where 
    (((`film_actor`.`actor_id` > @_split_column_variable_range_start_1)) OR ((`film_actor`.
`actor_id` = @_split_column_variable_range_start_1) AND (`film_actor`.`film_id` > @_split_column_variable_range_start_2))) and (((`film_actor`.`actor_id` < @_split_column_variable_max_1)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` < @_split_column_variable_max_2)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` = @_split_column_variable_max_2))) 
  order by 
    actor_id ASC, film_id ASC 
  limit 1000 
  ) sel_split_range  
order by 
  actor_id DESC, film_id DESC 
limit 1  
into @_split_column_variable_range_end_1, @_split_column_variable_range_end_2
;

By the way, you may recall that everything is done server side. The WHERE condition for the chunked queries is in itself generated via SQL statement, and not too much by programmatic logic. Here’s part of the query which computes the limiting condition:

  select
    group_concat('(', partial_comparison, ')' order by n separator ' OR ') as comparison
  from (
    select 
      n,
      group_concat('(', column_name, ' ', if(is_last, comparison_operator, '='), ' ', variable_name, ')' order by column_order separator ' AND ') as partial_comparison
    from (
      select 
        n, CONCAT(mysql_qualify(split_table_name), '.', mysql_qualify(column_name)) AS column_name,
        case split_variable_type
          when 'range_start' then range_start_variable_name
          when 'range_end' then range_end_variable_name
          when 'max' then max_variable_name
        end as variable_name,
        _split_column_names_table.column_order, _split_column_names_table.column_order = n as is_last 
      from 
        numbers, _split_column_names_table 
      where 
        n between _split_column_names_table.column_order and num_split_columns 
      order by n, _split_column_names_table.column_order
    ) s1
    group by n
  ) s2
  into return_value
  ;

There is a lot of complexity to split to make it able to provide with as clean a syntax for the user as possible.

5
Leave a Reply

avatar
5 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
shlomiRick James Recent comment authors

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of
Rick James
Guest
Rick James

This is, in my opinion, a better way to ‘chunk’ an ID that is not necessarily dense: $idz = SELECT id FROM tbl WHERE id > $ida LIMIT 1000,1; Then you proceed to do SELECT … WHERE id >= $ida AND id <= $idz The first query can (in some situations) operate only in the index, thereby not slogging through all the data. And it does not need to build a temp table. In the 2-column case, things get messier. I prefer to discover which case exists and act accordingly: WHERE col1 = 123 AND col2 BETWEEN… — when 123… Read more »

Rick James
Guest
Rick James

> Your solution (1 column) misses the case where there aren’t 1000 rows. “The devil is in the details.” Also misses the last chunk. > unbalanced chunks. I don’t see that as a problem. In ‘real’ cases, I find that the average tends to be over 90% of the requested chunk size (1000). In extreme cases, some chunks might have only 1 row. My algorithm does not care how what the ordering is (int, string, etc) done — because ORDER BY, LIMIT, WHERE …>… take care of it. N columns — the WHERE clause id dynamically created and has a… Read more »

Rick James
Guest
Rick James

This comment may be more important — It is about the inefficient optimization of OR versus AND (at least in older versions of MySQL. You have: WHERE ((actor_id > 1) OR (actor_id = 1 AND film_id > 1) OR (actor_id = 1 AND film_id = 1)) AND ((actor_id < 39) OR (actor_id = 39 AND film_id = 1) AND (film_id >= 1 OR actor_id > 1) AND (actor_id <= 39) AND (actor_id < 39 OR film_id <= 293) By playing with De Morgan's law, I have bubbled ANDs to the top, thereby letting it bound the scan over actor_id BETWEEN… Read more »