{"id":5157,"date":"2012-11-14T11:15:35","date_gmt":"2012-11-14T09:15:35","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=5157"},"modified":"2012-11-14T11:20:08","modified_gmt":"2012-11-14T09:20:08","slug":"purging-old-rows-with-queryscript-three-use-cases","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/purging-old-rows-with-queryscript-three-use-cases","title":{"rendered":"Purging old rows with QueryScript: three use cases"},"content":{"rendered":"<p>Problem: you need to purge old rows from a table. This may be your weekly\/monthly cleanup task. The table is large, the amount of rows to be deleted is large, and doing so in one big <strong>DELETE<\/strong> is too heavy.<\/p>\n<p>You can use <a href=\"http:\/\/openarkkit.googlecode.com\/svn\/trunk\/openarkkit\/doc\/html\/oak-chunk-update.html\">oak-chunk-update<\/a> or <a href=\"http:\/\/www.percona.com\/doc\/percona-toolkit\/2.1\/pt-archiver.html\">pt-archiver<\/a> to accomplish the task. You can also use server side scripting with <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script.html\">QueryScript<\/a>, offering a very simple syntax with no external scripting, dependencies and command line options.<\/p>\n<p>I wish to present three cases of row deletion, with three different solutions. In all cases we assume some <strong>TIMESTAMP<\/strong> column exists in table, by which we choose to purge the row. In all cases we assume we wish to purge rows older than <strong>1<\/strong> month.<\/p>\n<p>We assume the naive query is this:<\/p>\n<blockquote>\n<pre>DELETE FROM my_schema.my_table WHERE row_timestamp &lt; CURDATE() - INTERVAL 1 MONTH<\/pre>\n<\/blockquote>\n<h4>Case 1: TIMESTAMP column is indexed<\/h4>\n<p>I almost always index a timestamp column, if only for being able to quickly purge data (but usually also to slice data by date). In this case where the column is indexed, it&#8217;s very easy to figure out which rows are older than <strong>1<\/strong> month.<\/p>\n<p>We break the naive query into smaller parts, and execute these in sequence:<!--more--><\/p>\n<blockquote>\n<pre>while (<span style=\"color: #000080;\"><strong>DELETE FROM<\/strong> my_schema.my_table <strong>WHERE<\/strong> row_timestamp &lt; CURDATE() - INTERVAL 1 MONTH <strong>ORDER BY<\/strong> row_timestamp <strong>LIMIT<\/strong> 1000<\/span>)\r\n\u00a0 throttle 1;<\/pre>\n<\/blockquote>\n<p>How does the above work?<\/p>\n<p>QueryScript accepts a <strong>DELETE<\/strong> statement as a conditional expression in a while loop. The expression evaluates to <strong>TRUE<\/strong> when the <strong>DELETE<\/strong> affects rows. Once the <strong>DELETE<\/strong> ceases to affect rows (when no more rows match the <strong>WHERE<\/strong> condition), the <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_while.html\"><strong>while<\/strong><\/a> loop terminates.<\/p>\n<p>The <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_throttle.html\"><strong>throttle<\/strong><\/a> command allows us to play <em>nice<\/em>: by throttling we increase the total runtime through sleeping in between loop iterations.<\/p>\n<h4>Case 2: TIMESTAMP column is not indexed, and there is no heuristic for matching rows<\/h4>\n<p>This case is hardest to tackle by means of optimization: there is no index, and we cannot assume or predict anything about the distribution of old rows. We must therefore scan the entire table so as to be able to purge old rows.<\/p>\n<p>This <em>does not<\/em> mean we have to do one huge full table scan. As long as we have some way to split the table, we are still good. We can utilize the <strong>PRIMARY KEY<\/strong> or another <strong>UNIQUE KEY<\/strong> so as to break the table into smaller, distinct parts, and work our way on these smaller chunks:<\/p>\n<blockquote>\n<pre><strong>split<\/strong> (<span style=\"color: #000080;\">DELETE FROM my_schema.my_table WHERE row_timestamp &lt; CURDATE() - INTERVAL 1 MONTH<\/span>)\r\n\u00a0 throttle 1;<\/pre>\n<\/blockquote>\n<p>The <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_split.html\"><strong>split<\/strong><\/a> statement will automagically calculate the chunks and inject filtering conditions onto the query, such that each execution of the query relates to a distinct set of rows.<\/p>\n<h4>Case 3: TIMESTAMP column not indexed, but known to be monotonic<\/h4>\n<p>This is true for many tables. Rows with <strong>AUTO_INCREMENT<\/strong> columns and <strong>TIMESTAMP<\/strong> columns are created with <strong>CURRENT_TIMESTAMP<\/strong> values. This makes for a monotonic function: as the <strong>AUTO_INCREMENT<\/strong> grows, so does the <strong>TIMESTAMP<\/strong>.<\/p>\n<p>This makes for the following observation: it we iterate the table row by row, and reach a point where the current row is not old, then we can stop looking. Timestamps will only increase by value, which means further rows only turn to be <em>newer<\/em>.<\/p>\n<p>With this special case at hand, we can:<\/p>\n<blockquote>\n<pre><strong>split<\/strong> (<span style=\"color: #000080;\"><strong><\/strong>DELETE FROM my_schema.my_table WHERE row_timestamp &lt; CURDATE() - INTERVAL 1 MONTH<\/span>) {\r\n\u00a0 if (<strong>$split_rowcount<\/strong> = 0)\r\n\u00a0\u00a0\u00a0 break;\r\n\u00a0 throttle 1;\r\n}<\/pre>\n<\/blockquote>\n<p><em>split<\/em> is a looping device, and a <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_break.html\"><strong>break<\/strong><\/a> statement works on <em>split<\/em> just as on a <strong>while<\/strong> statement.<\/p>\n<p><em>split<\/em> provides with magic variables which describe current chunk status. <strong>$split_rowcount<\/strong> relates to the number of rows affected by last chunk query. No more rows affected? This means we&#8217;ve hit recent rows, and we do not expect to find old rows any further. We can stop looking.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Problem: you need to purge old rows from a table. This may be your weekly\/monthly cleanup task. The table is large, the amount of rows to be deleted is large, and doing so in one big DELETE is too heavy. You can use oak-chunk-update or pt-archiver to accomplish the task. You can also use server [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[93,67,26,76,50],"class_list":["post-5157","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-bulk-operations","tag-common_schema","tag-indexing","tag-queryscript","tag-scripts"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1lb","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/5157","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=5157"}],"version-history":[{"count":29,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/5157\/revisions"}],"predecessor-version":[{"id":5786,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/5157\/revisions\/5786"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=5157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=5157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=5157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}