Another use for “top N records per group” query

A few days ago I published SQL: selecting top N records per group. It just dawned on me that the very same query solved another type of problem I was having a couple years ago.

BTW, for reference, Baron Schwartz posted this 4 years ago. There are very interesting approaches in text and in comments. Good to see the ancients already knew of such problems, I should study my history better.

(Kidding, kidding! This is self criticism of course)

In this case I present now I have a table with recurring data, which, to some extent, represents revision of data. For the sake of simplicity let’s describe it as a simple simulation of a revision system:

I have text files, whose content I store within a row
Each text file uses at least one row in the table
Text files can be edited, whereas an edited file is written in a new row (never UPDATEd).

Hold your horses: I’m not really implementing a revision system, I just can’t get into the actual details here.

The table becomes large quickly, and it’s desired to purge rows from the table. The rule is: we must obtain the most recent two versions for each file (if there are indeed more than one). All others can be purged. So the question is:

Which are all the 3rd newest, 4th newst, …, older, …, oldest revisions per file?

To use the example from the aforementioned post:

Which are the 3rd largest, 4th largest, …, smallest countries in each continent?

Assuming group_concat_max_len allows, and enough numbers are available, all we need to do is change this condition:

WHERE
  tinyint_asc.value >= 1 AND tinyint_asc.value <= 5

into:

WHERE
  tinyint_asc.value >= 3

More memory, more CPU

The original query limited the number of resulting rows (only 5 per group). There’s no such limitation now; there could be 1,000 revision per file, there could be more. So this latest query could incur more overhead. Back to the original problem, if I do the purging frequently enough, this shouldn’t be a problem.

Right. Now to the task of revisiting old, forgotten code 🙂

More memory, more CPU

Leave a Reply