{"id":5035,"date":"2012-09-06T07:25:07","date_gmt":"2012-09-06T05:25:07","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=5035"},"modified":"2012-09-06T08:05:16","modified_gmt":"2012-09-06T06:05:16","slug":"how-common_schema-splits-tables-internals","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/how-common_schema-splits-tables-internals","title":{"rendered":"How common_schema split()s tables &#8211; internals"},"content":{"rendered":"<p>This post exposes some of the internals, and the SQL behind QueryScript&#8217;s <em>split<\/em>. <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script.html\">common_schema\/QueryScript<\/a> <strong>1.1<\/strong> introduces the <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_split.html\"><strong>split<\/strong><\/a> statement, which auto-breaks a &#8220;large&#8221; query (one which operates on large tables as a whole or without keys) into smaller queries, and executes them in sequence.<\/p>\n<p>This makes for easier transactions, less locks held, potentially (depending on the user) more idle time released back to the database. <em>split<strong><\/strong><\/em> has similar concepts to <a href=\"http:\/\/openarkkit.googlecode.com\/svn\/trunk\/openarkkit\/doc\/html\/oak-chunk-update.html\">oak-chunk-update<\/a> and <a href=\"http:\/\/www.percona.com\/doc\/percona-toolkit\/2.1\/pt-archiver.html\">pt-archiver<\/a>, but works differently, and implemented entirely in SQL on server side.<\/p>\n<p>Take the following statement as example:<\/p>\n<blockquote>\n<pre><strong>split<\/strong> (<strong>UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR<\/strong>)\r\n\u00a0 pass;<\/pre>\n<\/blockquote>\n<p>It yields with (roughly) the following statements:<\/p>\n<blockquote>\n<pre>UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`inventory`.`inventory_id` &gt; '1')) OR ((`inventory`.`inventory_id` = '1'))) AND (((`inventory`.`inventory_id` &lt; '1000')) OR ((`inventory`.`inventory_id` = '1000'))));\r\nUPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`inventory`.`inventory_id` &gt; '1000'))) AND (((`inventory`.`inventory_id` &lt; '2000')) OR ((`inventory`.`inventory_id` = '2000'))));\r\nUPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`inventory`.`inventory_id` &gt; '2000'))) AND (((`inventory`.`inventory_id` &lt; '3000')) OR ((`inventory`.`inventory_id` = '3000'))));\r\nUPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`inventory`.`inventory_id` &gt; '3000'))) AND (((`inventory`.`inventory_id` &lt; '4000')) OR ((`inventory`.`inventory_id` = '4000'))));\r\nUPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`inventory`.`inventory_id` &gt; '4000'))) AND (((`inventory`.`inventory_id` &lt; '4581')) OR ((`inventory`.`inventory_id` = '4581'))));<\/pre>\n<\/blockquote>\n<p>(I say &#8220;roughly&#8221; because internally there are user defined variables at play, but for convenience, I verbose the actual values as constants.)<\/p>\n<h4>How does that work?<\/h4>\n<p><em>common_schema<\/em> works on server side. There is no Perl script or anything. It must therefore use server-side operations to:<\/p>\n<ul>\n<li>Identify table to be split<\/li>\n<li>Analyze the table in the first place, deciding how to split it<\/li>\n<li>Analyze the query, deciding on how to rewrite it<\/li>\n<li>Split the table (logically) into unique and distinct chunks<\/li>\n<li>Work out the query on each such chunk<\/li>\n<\/ul>\n<p>Following is an internal look at how <em>common_schema<\/em> does all the above.<!--more--><\/p>\n<h4>Identifying the table<\/h4>\n<p>When query operates on a single table, <em>split<\/em> is able to parse the query&#8217;s SQL and find out that table. When multiple tables are involved, <em>split<\/em> requires user instruction: which table is it that the query should be split by?<\/p>\n<h4>Analyzing the table<\/h4>\n<p>Table analysis is done via a <em>similar<\/em> method to <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/candidate_keys_recommended.html\">candidate_keys_recommended<\/a>. It is almost identical, only it uses <a href=\"http:\/\/dev.mysql.com\/doc\/refman\/5.1\/en\/information-schema-optimization.html\">INFORMATION_SCHEMA optimizations<\/a> to make the query short and lightweight. Simulating the analysis using <strong>candidate_keys_recommended<\/strong>, we get:<\/p>\n<blockquote>\n<pre>mysql&gt; select * from candidate_keys_recommended where table_name='inventory' \\G\r\n*************************** 1. row ***************************\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 table_schema: sakila\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 table_name: inventory\r\nrecommended_index_name: PRIMARY\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 has_nullable: 0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 is_primary: 1\r\n\u00a0count_column_in_index: 1\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 column_names: inventory_id<\/pre>\n<\/blockquote>\n<p>This is cool, simple and very easy to work with: we choose to split the table via the <strong>inventory_id<\/strong> column, which is conveniently an integer. We&#8217;ll soon see <em>split<\/em> can handle complex cases as well.<\/p>\n<h4>Analyzing the query<\/h4>\n<p>This is done in part via Roland&#8217;s <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_analysis_routines.html\">query_analysis_routines<\/a>, and in part just parsing the query, looking for <strong>WHERE<\/strong>,<strong> GROUP BY<\/strong>, <strong>LIMIT<\/strong> etc. clauses.<\/p>\n<p>The nice part is injecting a <strong>WHERE<\/strong> condition, which didn&#8217;t appear in the original query. That <strong>WHERE<\/strong> condition is what limits the query to a distinct chunk of rows.<\/p>\n<h4>Splitting the table<\/h4>\n<p>With a single <strong>INTEGER PRIMARY KEY<\/strong> this sounds simple, right? Take rows <strong>1..1,000<\/strong>, then <strong>1,001..2,000<\/strong>, then <strong>2,001..3,000<\/strong> etc.<\/p>\n<p>Wrong: even with this simple scenario, things are much more complex. Are the numbers successive? What if there are holes? What if there is a <strong>1,000,000<\/strong> gap between every two numbers? What if there are multiple holes of differing size and frequency?<\/p>\n<p>And if we have two columns in our <strong>UNIQUE KEY<\/strong>? What if one of them is textual, not an <strong>INTEGER<\/strong>, the other a <strong>TIMESTAMP<\/strong>, not an <strong>INTEGER<\/strong> either?<\/p>\n<p><em>split<\/em> doesn&#8217;t work in that naive way. It makes no assumptions on the density of values. It only requires:<\/p>\n<ul>\n<li>some <strong>UNIQUE KEY<\/strong> to work with,<\/li>\n<li>which has no <strong>NULL<\/strong> values.<\/li>\n<\/ul>\n<p>Given the above, it uses <em>User Defined Variables<\/em> to setup the chunks. With our single <strong>INTEGER<\/strong> column, the minimum value is set like this:<\/p>\n<blockquote>\n<pre>select \r\n\u00a0 inventory_id \r\nfrom \r\n\u00a0 `sakila`.`inventory` \r\norder by \r\n\u00a0 inventory_id ASC \r\nlimit 1 \u00a0\r\ninto @_split_column_variable_min_1\r\n;<\/pre>\n<\/blockquote>\n<p>This sets the first value of the first chunk. What value terminates this chunk? It is calculated like this:<\/p>\n<blockquote>\n<pre>select \r\n\u00a0 inventory_id \r\nfrom (\r\n\u00a0 select \r\n\u00a0\u00a0\u00a0 inventory_id \r\n\u00a0 from \r\n\u00a0\u00a0\u00a0 `sakila`.`inventory` \r\n\u00a0 where \r\n\u00a0\u00a0\u00a0 (((`inventory`.`inventory_id` &gt; @_split_column_variable_range_start_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_range_start_1))) and (((`inventory`.`inventory_id` &lt; @_split_column_variable_max_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_max_1))) \r\n\u00a0 order by \r\n\u00a0\u00a0\u00a0 inventory_id ASC limit 1000 \r\n\u00a0 ) sel_split_range \u00a0\r\norder by \r\n\u00a0 inventory_id DESC \r\nlimit 1 \u00a0\r\ninto @_split_column_variable_range_end_1\r\n;<\/pre>\n<\/blockquote>\n<p>Now there&#8217;s a query you wouldn&#8217;t want to work by hand, now would you?<\/p>\n<p>The cool part here is that the above works well for any type of column; this doesn&#8217;t have to be an <strong>INTEGER<\/strong>. Dates, strings etc. are all just fine.<\/p>\n<p>The above also works well for multiple columns, where the query gets more complicated (see following).<\/p>\n<h4>Working out the query per chunk<\/h4>\n<p>This part is the easy one, now that all the hard work is done. We know ho to manipulate the query, we know the lower and upper boundaries of the chunk, so we just fill in the values and execute.<\/p>\n<h4>Multi-columns keys<\/h4>\n<p>Consider a similar query on <strong>sakila.film_actor<\/strong>, where the <strong>PRIMARY KEY<\/strong> is a compound of two columns:<\/p>\n<blockquote>\n<pre>split (UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR)\r\n\u00a0 throttle 2;<\/pre>\n<\/blockquote>\n<p>The chunked queries will look like this:<\/p>\n<blockquote>\n<pre>UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`film_actor`.`actor_id` &gt; '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` &gt; '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` = '1'))) AND (((`film_actor`.`actor_id` &lt; '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` &lt; '293')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` = '293'))));\r\nUPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`film_actor`.`actor_id` &gt; '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` &gt; '293'))) AND (((`film_actor`.`actor_id` &lt; '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` &lt; '234')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` = '234'))));\r\nUPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`film_actor`.`actor_id` &gt; '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` &gt; '234'))) AND (((`film_actor`.`actor_id` &lt; '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` &lt; '513')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` = '513'))));\r\nUPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`film_actor`.`actor_id` &gt; '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` &gt; '513'))) AND (((`film_actor`.`actor_id` &lt; '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` &lt; '278')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` = '278'))));\r\nUPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`film_actor`.`actor_id` &gt; '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` &gt; '278'))) AND (((`film_actor`.`actor_id` &lt; '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` &lt; '862')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` = '862'))));\r\nUPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR <strong>WHERE<\/strong> ((((`film_actor`.`actor_id` &gt; '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` &gt; '862'))) AND (((`film_actor`.`actor_id` &lt; '200')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` &lt; '993')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` = '993'))));<\/pre>\n<\/blockquote>\n<p>View the complete command to realize just how much more complex each query is, and how much more complex the chunking becomes. Here&#8217;s how I evaluate the chunk&#8217;s &#8220;next range end&#8221; variables:<\/p>\n<blockquote>\n<pre>select \r\n\u00a0 actor_id, film_id \r\nfrom (\r\n\u00a0 select \r\n\u00a0\u00a0\u00a0 actor_id, film_id \r\n\u00a0 from \r\n\u00a0\u00a0\u00a0 `sakila`.`film_actor` \r\n\u00a0 where \r\n\u00a0\u00a0\u00a0 (((`film_actor`.`actor_id` &gt; @_split_column_variable_range_start_1)) OR ((`film_actor`.\r\n`actor_id` = @_split_column_variable_range_start_1) AND (`film_actor`.`film_id` &gt; @_split_column_variable_range_start_2))) and (((`film_actor`.`actor_id` &lt; @_split_column_variable_max_1)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` &lt; @_split_column_variable_max_2)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` = @_split_column_variable_max_2))) \r\n\u00a0 order by \r\n\u00a0\u00a0\u00a0 actor_id ASC, film_id ASC \r\n\u00a0 limit 1000 \r\n\u00a0 ) sel_split_range \u00a0\r\norder by \r\n\u00a0 actor_id DESC, film_id DESC \r\nlimit 1 \u00a0\r\ninto @_split_column_variable_range_end_1, @_split_column_variable_range_end_2\r\n;<\/pre>\n<\/blockquote>\n<p>By the way, you may recall that everything is done server side. The <strong>WHERE<\/strong> condition for the chunked queries is in itself generated via SQL statement, and not too much by programmatic logic. Here&#8217;s <em>part<\/em> of the query which computes the limiting condition:<\/p>\n<blockquote>\n<pre>\u00a0 select\r\n\u00a0\u00a0\u00a0 group_concat('(', partial_comparison, ')' order by n separator ' OR ') as comparison\r\n\u00a0 from (\r\n\u00a0\u00a0\u00a0 select \r\n\u00a0\u00a0\u00a0\u00a0\u00a0 n,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 group_concat('(', column_name, ' ', if(is_last, comparison_operator, '='), ' ', variable_name, ')' order by column_order separator ' AND ') as partial_comparison\r\n\u00a0\u00a0\u00a0 from (\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 select \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 n, CONCAT(mysql_qualify(split_table_name), '.', mysql_qualify(column_name)) AS column_name,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 case split_variable_type\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 when 'range_start' then range_start_variable_name\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 when 'range_end' then range_end_variable_name\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 when 'max' then max_variable_name\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 end as variable_name,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 _split_column_names_table.column_order, _split_column_names_table.column_order = n as is_last \r\n\u00a0\u00a0\u00a0\u00a0\u00a0 from \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 numbers, _split_column_names_table \r\n\u00a0\u00a0\u00a0\u00a0\u00a0 where \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 n between _split_column_names_table.column_order and num_split_columns \r\n\u00a0\u00a0\u00a0\u00a0\u00a0 order by n, _split_column_names_table.column_order\r\n\u00a0\u00a0\u00a0 ) s1\r\n\u00a0\u00a0\u00a0 group by n\r\n\u00a0 ) s2\r\n\u00a0 into return_value\r\n\u00a0 ;<\/pre>\n<\/blockquote>\n<p>There is a lot of complexity to <em>split<\/em> to make it able to provide with as clean a syntax for the user as possible.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post exposes some of the internals, and the SQL behind QueryScript&#8217;s split. common_schema\/QueryScript 1.1 introduces the split statement, which auto-breaks a &#8220;large&#8221; query (one which operates on large tables as a whole or without keys) into smaller queries, and executes them in sequence. This makes for easier transactions, less locks held, potentially (depending on [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[67,26,24,34,76,50],"class_list":["post-5035","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-common_schema","tag-indexing","tag-information_schema","tag-openark-kit","tag-queryscript","tag-scripts"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1jd","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/5035","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=5035"}],"version-history":[{"count":48,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/5035\/revisions"}],"predecessor-version":[{"id":5370,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/5035\/revisions\/5370"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=5035"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=5035"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=5035"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}