A few days ago I published SQL: selecting top N records per group. It just dawned on me that the very same query solved another type of problem I was having a couple years ago.
BTW, for reference, Baron Schwartz posted this 4 years ago. There are very interesting approaches in text and in comments. Good to see the ancients already knew of such problems, I should study my history better.
(Kidding, kidding! This is self criticism of course)
In this case I present now I have a table with recurring data, which, to some extent, represents revision of data. For the sake of simplicity let’s describe it as a simple simulation of a revision system:
- I have text files, whose content I store within a row
- Each text file uses at least one row in the table
- Text files can be edited, whereas an edited file is written in a new row (never UPDATEd).
Hold your horses: I’m not really implementing a revision system, I just can’t get into the actual details here.
The table becomes large quickly, and it’s desired to purge rows from the table. The rule is: we must obtain the most recent two versions for each file (if there are indeed more than one). All others can be purged. So the question is:
Which are all the 3rd newest, 4th newst, …, older, …, oldest revisions per file?
To use the example from the aforementioned post:
Which are the 3rd largest, 4th largest, …, smallest countries in each continent?
Assuming group_concat_max_len allows, and enough numbers are available, all we need to do is change this condition:
WHERE tinyint_asc.value >= 1 AND tinyint_asc.value <= 5
into:
WHERE tinyint_asc.value >= 3
More memory, more CPU
The original query limited the number of resulting rows (only 5 per group). There’s no such limitation now; there could be 1,000 revision per file, there could be more. So this latest query could incur more overhead. Back to the original problem, if I do the purging frequently enough, this shouldn’t be a problem.
Right. Now to the task of revisiting old, forgotten code 🙂