{"id":4588,"date":"2012-02-09T10:33:08","date_gmt":"2012-02-09T08:33:08","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=4588"},"modified":"2012-02-09T10:33:08","modified_gmt":"2012-02-09T08:33:08","slug":"mysqlqueryscript-use-case-delete-all-but-top-n-records-per-group","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/mysqlqueryscript-use-case-delete-all-but-top-n-records-per-group","title":{"rendered":"MySQL\/QueryScript use case: DELETE all but top N records per group"},"content":{"rendered":"<p>Some administrative tasks can be simplified by using <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script.html\">common_schema\/QueryScript<\/a>. I&#8217;m collecting a bunch of these for documentation. Here&#8217;s one for example:<\/p>\n<p>The DBA\/developer has the task of retaining only top <strong>3<\/strong> most populated countries per continent. That is, she has to <strong>DELETE 4th, 5th, 6th<\/strong>, &#8230; most populated counties in each continent.<\/p>\n<p>Is it possible to work out with a single query? Yes. But the query is not pretty. In fact, it is quite complicated, and either involves unintuitive subqueries, or <a href=\"http:\/\/code.openark.org\/blog\/mysql\/sql-selecting-top-n-records-per-group\">unintuitive hacks<\/a>. A normal DBA would not want to write, neither maintain this kind of query, unless top-notch-geek, which is fine.<\/p>\n<p>Since this is a one time job, we just need to get it done. And <em>common_schema<\/em>\/QueryScript provide with the intuitive solution: if we read our demand aloud, we realize we want to <strong>delete<\/strong> <strong>4th, 5th, 6th<\/strong>, &#8230; populated countries <strong>for each<\/strong> continent.<\/p>\n<p>I present a solution made available by QueryScript, and discuss the ways in which the code overcomes limitations, or simplifies complexity:<\/p>\n<blockquote>\n<pre>var $num_countries_to_delete;\r\nforeach($continent, $num_countries: SELECT continent, COUNT(*) FROM world.Country GROUP BY continent)\r\n{\r\n  if ($num_countries &gt; 3)\r\n  {\r\n    set $num_countries_to_delete := $num_countries - 3;\r\n    DELETE FROM world.Country WHERE Continent = $continent ORDER BY Population ASC LIMIT :$num_countries_to_delete;\r\n  }\r\n}<\/pre>\n<\/blockquote>\n<h4>Discussion<\/h4>\n<p>The first thing that should be apparent from the above is that this is a <em>programmatic<\/em> solution. Queries are declarative, which is why complex ones sometimes look incomprehensible. The above is more straightforward.<!--more--><\/p>\n<p>The next thing to realize, which is a disclosure issue of some sorts, is that the above code is fine for a one time, or maintenance execution; but you wouldn&#8217;t want to be normally issuing this type of code against your database <strong>10,000<\/strong> times a second.<\/p>\n<p>Now let&#8217;s break down the code to fragments:<\/p>\n<h4>Discussion: variables<\/h4>\n<p>The <strong>$num_countries_to_delete<\/strong> is a script variable. It is local. It is reset to <strong>NULL<\/strong> upon declaration and destroyed when its visibility ends. But the <em>real power<\/em> comes later, when it is <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_variables.html#expansion\">expanded<\/a>. This is discussed last.<\/p>\n<h4>Discussion: iteration<\/h4>\n<p>How would you iterate the continents using a stored routine? I personally think the syntax for server side cursors is overwhelmingly verbose. Declare a cursor, declare a continue handler, declare variables to grab values, open the cursor, start a loop, iteratively fetch the cursor (assign row values onto variables), oh, check up on the continue handler (<em>programmatically<\/em> exit the loop if it fails), close the cursor.<\/p>\n<p>The <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_foreach.html\">foreach()<\/a> loop statement was developed to simplify all the above. Hey: just name your query, and the list of variables which should be assigned to, and do your thing in the following statement.<\/p>\n<h4>Discussion: conditional branching<\/h4>\n<p>The standard SQL <strong>CASE<\/strong> statement, and the additional <strong>IF()<\/strong> statement are fine, and I use them a lot. But they are fine for <strong>SELECT<\/strong> queries, and only allow you to <em>get<\/em> data. At best, you may invoke a function based on some condition, which can actually modify data.<\/p>\n<p>With QueryScript it&#8217;s as with your normal programming language: you can <strong>DELETE<\/strong> if some condition holds true, <strong>INSERT<\/strong> or <strong>SELECT<\/strong> or <strong>ALTER<\/strong> or whatever if false.<\/p>\n<p>In the above code there isn&#8217;t too much news. The same can be done with stored routines. However the <strong>if<\/strong> statement can also accept a query as a condition. One can ask: <strong>if (DELETE FROM &#8230; WHERE&#8230;)<\/strong>. The condition holds true only is the operation was successful (rows actually DELETEd, or INSERTed, or UPDATEed). This makes for a very tight integration between script and SQL.<\/p>\n<h4>Discussion: variables and variable expansion<\/h4>\n<p>Script variables behave just as normal MySQL user defined variables (in fact, current internal implementation of script variables is <em>by<\/em> user defined variables). Which means the <strong>set<\/strong> statement works for them just as normal.<\/p>\n<p>And here is where things become not-normal:<\/p>\n<p>Say we want to delete all but the 3 most populated countries in Europe. Wouldn&#8217;t we like to issue a <strong>DELETE FROM Country WHERE Continent = &#8216;Europe&#8217; ORDER BY Population DESC LIMIT 3, 999999999<\/strong>? (The 9999999999 to resemble &#8220;infinite&#8221;, in poor man&#8217;s solution)<\/p>\n<p>But MySQL&#8217;s <strong>DELETE<\/strong> does not accept both limit &amp; offset in the <strong>LIMIT<\/strong> clause. Just the limit part. Which is why we&#8217;re working the other way round: we find out the number of records we wish to purge and delete bottom up. But wait, here&#8217;s another problem:<\/p>\n<p>In MySQL, the <strong>LIMIT<\/strong> clause <em>must accept a constant<\/em>. You can just <strong>DELETE FROM &#8230;. LIMIT @x<\/strong>. This makes for a syntax error. Bummer!<\/p>\n<p>If we don&#8217;t know ahead the number of records we wish to purge, how can we work both dynamically and correctly?<\/p>\n<p>Enter variable <a href=\"http:\/\/common-schema.googlecode.com\/svn\/trunk\/common_schema\/doc\/html\/query_script_variables.html#expansion\">expansion<\/a>. In the statement:<\/p>\n<blockquote>\n<pre>DELETE FROM world.Country WHERE Continent = $continent ORDER BY Population ASC LIMIT :$num_countries_to_delete;<\/pre>\n<\/blockquote>\n<p>The <strong>$num_countries_to_delete<\/strong> variable is expanded, via &#8220;<strong>:<\/strong>&#8220;. The <strong>:$num_countries_to_delete<\/strong> token is replaced in-place with the value contained by <strong>$num_countries_to_delete<\/strong>. MySQL never gets a variable in the <strong>LIMIT<\/strong> clause: by the time the query reaches MySQL, theres a <em>constant<\/em> in place, and none is the wiser. But as far as <em>we&#8217;re<\/em> concerned, we get a dynamic way of producing values to the <strong>LIMIT<\/strong> clause.<\/p>\n<p><strong>LIMIT<\/strong> is not the only clause which expects constants. How about <strong>KILL<\/strong>? How about DDLs, such as <strong>CREATE TABLE<\/strong>? With variable expansion you can dynamically inject values onto such clauses, statements and commands, and get your self a dynamic script.<\/p>\n<h4>Conclusion<\/h4>\n<p>This small code sample exposes much of QueryScript&#8217;s power. Throughout the months of development, I happened to use QueryScript code over and over on production, to realize how it can sometimes simplify very complex tasks into a mere 2-liner code. A code that any of my fellow programmers can understand, as well, without having to be SQL experts. And such which is executed within the server; no need for external languages, connectors, dependencies, packages etc.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Some administrative tasks can be simplified by using common_schema\/QueryScript. I&#8217;m collecting a bunch of these for documentation. Here&#8217;s one for example: The DBA\/developer has the task of retaining only top 3 most populated countries per continent. That is, she has to DELETE 4th, 5th, 6th, &#8230; most populated counties in each continent. Is it possible [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[67,76,50,21],"class_list":["post-4588","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-common_schema","tag-queryscript","tag-scripts","tag-sql"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1c0","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/4588","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=4588"}],"version-history":[{"count":22,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/4588\/revisions"}],"predecessor-version":[{"id":4714,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/4588\/revisions\/4714"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=4588"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=4588"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=4588"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}