MySQL/QueryScript use case: DELETE all but top N records per group

Some administrative tasks can be simplified by using common_schema/QueryScript. I’m collecting a bunch of these for documentation. Here’s one for example:

The DBA/developer has the task of retaining only top 3 most populated countries per continent. That is, she has to DELETE 4th, 5th, 6th, … most populated counties in each continent.

Is it possible to work out with a single query? Yes. But the query is not pretty. In fact, it is quite complicated, and either involves unintuitive subqueries, or unintuitive hacks. A normal DBA would not want to write, neither maintain this kind of query, unless top-notch-geek, which is fine.

Since this is a one time job, we just need to get it done. And common_schema/QueryScript provide with the intuitive solution: if we read our demand aloud, we realize we want to delete 4th, 5th, 6th, … populated countries for each continent.

I present a solution made available by QueryScript, and discuss the ways in which the code overcomes limitations, or simplifies complexity:

var $num_countries_to_delete;
foreach($continent, $num_countries: SELECT continent, COUNT(*) FROM world.Country GROUP BY continent)
{
  if ($num_countries > 3)
  {
    set $num_countries_to_delete := $num_countries - 3;
    DELETE FROM world.Country WHERE Continent = $continent ORDER BY Population ASC LIMIT :$num_countries_to_delete;
  }
}

Discussion

The first thing that should be apparent from the above is that this is a programmatic solution. Queries are declarative, which is why complex ones sometimes look incomprehensible. The above is more straightforward.

The next thing to realize, which is a disclosure issue of some sorts, is that the above code is fine for a one time, or maintenance execution; but you wouldn’t want to be normally issuing this type of code against your database 10,000 times a second.

Now let’s break down the code to fragments:

Discussion: variables

The $num_countries_to_delete is a script variable. It is local. It is reset to NULL upon declaration and destroyed when its visibility ends. But the real power comes later, when it is expanded. This is discussed last.

Discussion: iteration

How would you iterate the continents using a stored routine? I personally think the syntax for server side cursors is overwhelmingly verbose. Declare a cursor, declare a continue handler, declare variables to grab values, open the cursor, start a loop, iteratively fetch the cursor (assign row values onto variables), oh, check up on the continue handler (programmatically exit the loop if it fails), close the cursor.

The foreach() loop statement was developed to simplify all the above. Hey: just name your query, and the list of variables which should be assigned to, and do your thing in the following statement.

Discussion: conditional branching

The standard SQL CASE statement, and the additional IF() statement are fine, and I use them a lot. But they are fine for SELECT queries, and only allow you to get data. At best, you may invoke a function based on some condition, which can actually modify data.

With QueryScript it’s as with your normal programming language: you can DELETE if some condition holds true, INSERT or SELECT or ALTER or whatever if false.

In the above code there isn’t too much news. The same can be done with stored routines. However the if statement can also accept a query as a condition. One can ask: if (DELETE FROM … WHERE…). The condition holds true only is the operation was successful (rows actually DELETEd, or INSERTed, or UPDATEed). This makes for a very tight integration between script and SQL.

Discussion: variables and variable expansion

Script variables behave just as normal MySQL user defined variables (in fact, current internal implementation of script variables is by user defined variables). Which means the set statement works for them just as normal.

And here is where things become not-normal:

Say we want to delete all but the 3 most populated countries in Europe. Wouldn’t we like to issue a DELETE FROM Country WHERE Continent = ‘Europe’ ORDER BY Population DESC LIMIT 3, 999999999? (The 9999999999 to resemble “infinite”, in poor man’s solution)

But MySQL’s DELETE does not accept both limit & offset in the LIMIT clause. Just the limit part. Which is why we’re working the other way round: we find out the number of records we wish to purge and delete bottom up. But wait, here’s another problem:

In MySQL, the LIMIT clause must accept a constant. You can just DELETE FROM …. LIMIT @x. This makes for a syntax error. Bummer!

If we don’t know ahead the number of records we wish to purge, how can we work both dynamically and correctly?

Enter variable expansion. In the statement:

DELETE FROM world.Country WHERE Continent = $continent ORDER BY Population ASC LIMIT :$num_countries_to_delete;

The $num_countries_to_delete variable is expanded, via “:“. The :$num_countries_to_delete token is replaced in-place with the value contained by $num_countries_to_delete. MySQL never gets a variable in the LIMIT clause: by the time the query reaches MySQL, theres a constant in place, and none is the wiser. But as far as we’re concerned, we get a dynamic way of producing values to the LIMIT clause.

LIMIT is not the only clause which expects constants. How about KILL? How about DDLs, such as CREATE TABLE? With variable expansion you can dynamically inject values onto such clauses, statements and commands, and get your self a dynamic script.

Conclusion

This small code sample exposes much of QueryScript’s power. Throughout the months of development, I happened to use QueryScript code over and over on production, to realize how it can sometimes simplify very complex tasks into a mere 2-liner code. A code that any of my fellow programmers can understand, as well, without having to be SQL experts. And such which is executed within the server; no need for external languages, connectors, dependencies, packages etc.

One thought on “MySQL/QueryScript use case: DELETE all but top N records per group

  1. Here’s how to do a similar thing with US cities and states. I’ll leave it as an exercise to turn it into a DELETE.

    # The three most populous cities in each state:
    CREATE TEMPORARY TABLE t (
        seq SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (state, seq)
    )  ENGINE = MyISAM   -- required for the PRIMARY KEY trick
    SELECT state, city, population
        FROM US
        ORDER BY state, population DESC;
    SELECT state, seq, city, population
        FROM t
        WHERE seq <= 3
        ORDER BY state, seq;
    +-------+-----+--------------+------------+
    | state | seq | city         | population |
    +-------+-----+--------------+------------+
    | AK    |   1 | Anchorage    |     276263 |
    | AK    |   2 | Juneau       |      31796 |
    | AK    |   3 | Fairbanks    |      31351 |
    | AL    |   1 | Birmingham   |     231621 |
    | AL    |   2 | Montgomery   |     198325 |
    | AL    |   3 | Mobile       |     190274 |
    | AR    |   1 | Little Rock  |     184217 |
    | AR    |   2 | Fort Smith   |      81985 |
    | AR    |   3 | Fayetteville |      64864 |
    | AZ    |   1 | Phoenix      |    1428509 |
    | AZ    |   2 | Tucson       |     518907 |
    ...
    

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.