A while back I presented(*) an SQL trick to present with non-aggregated column on a GROUP BY query, without use of subquery or derived tables.
Based on a similar concept, combined with string walking, I now present a query which selects top-n records for each group, ordered by some condition. It will require no subqueries. It executes faster than its more conventional alternatives.
[UPDATE: this is MySQL only. Others can use Window Functions where available]
Using the simple world database, we answer the following question:
What are the top 5 largest (by area) countries for each continent? What are their names, surface area and population?
Similar questions would be:
What were the latest 5 films rented by each customer?
What were the most presented advertisements for each user?
etc.
Step 1: getting the top
We already know how to get a single column's value for the top country, as presented in the aforementioned post:
SELECT Continent, SUBSTRING_INDEX( GROUP_CONCAT(Name ORDER BY SurfaceArea DESC), ',', 1) AS Name FROM Country GROUP BY Continent ; +---------------+--------------------+ | Continent | Name | +---------------+--------------------+ | Asia | China | | Europe | Russian Federation | | North America | Canada | | Africa | Sudan | | Oceania | Australia | | Antarctica | Antarctica | | South America | Brazil | +---------------+--------------------+
Step 2: adding columns
This part is easy: just throw in the rest of the columns (again, only indicating the top country in each continent)
SELECT Continent, SUBSTRING_INDEX( GROUP_CONCAT(Name ORDER BY SurfaceArea DESC), ',', 1) AS Name, SUBSTRING_INDEX( GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC), ',', 1) AS SurfaceArea, SUBSTRING_INDEX( GROUP_CONCAT(Population ORDER BY SurfaceArea DESC), ',', 1) AS Population FROM Country GROUP BY Continent ; +---------------+--------------------+-------------+------------+ | Continent | Name | SurfaceArea | Population | +---------------+--------------------+-------------+------------+ | Asia | China | 9572900.00 | 1277558000 | | Europe | Russian Federation | 17075400.00 | 146934000 | | North America | Canada | 9970610.00 | 31147000 | | Africa | Sudan | 2505813.00 | 29490000 | | Oceania | Australia | 7741220.00 | 18886000 | | Antarctica | Antarctica | 13120000.00 | 0 | | South America | Brazil | 8547403.00 | 170115000 | +---------------+--------------------+-------------+------------+
Step 3: casting
You'll notice that the Population column from this last execution is aligned to the left. This is because it is believed to be a string. The GROUP_CONCAT clause concatenates values in one string, and SUBSTRING_INDEX parses a substring. The same applies to the SurfaceArea column. We'll cast Population as UNSIGNED and SurfaceArea as DECIMAL:
SELECT
Continent,
SUBSTRING_INDEX(
GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
',', 1) AS Name,
CAST(
SUBSTRING_INDEX(
GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
',', 1)
AS DECIMAL(20,2)
) AS SurfaceArea,
CAST(
SUBSTRING_INDEX(
GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
',', 1)
AS UNSIGNED
) AS Population
FROM
Country
GROUP BY
Continent
;
+---------------+--------------------+-------------+------------+
| Continent | Name | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia | China | 9572900.00 | 1277558000 |
| Europe | Russian Federation | 17075400.00 | 146934000 |
| North America | Canada | 9970610.00 | 31147000 |
| Africa | Sudan | 2505813.00 | 29490000 |
| Oceania | Australia | 7741220.00 | 18886000 |
| Antarctica | Antarctica | 13120000.00 | 0 |
| South America | Brazil | 8547403.00 | 170115000 |
+---------------+--------------------+-------------+------------+
Step 4: top n records
It's time to use string walking. Examples for string walking (described in the excellent SQL Cookbook) can be found here, here and here. We'll be using a numbers table: a simple table which lists ascending integer numbers. For example, you can use the following:
DROP TABLE IF EXISTS `tinyint_asc`; CREATE TABLE `tinyint_asc` ( `value` tinyint(3) unsigned NOT NULL default '0', PRIMARY KEY (value) ) ; INSERT INTO `tinyint_asc` VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255);
The trick is to apply the same technique as used above, not for a single row, but for several rows. Here's how to present the top 5 countries:
SELECT
Continent,
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
',', value),
',', -1)
AS Name,
CAST(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
',', value),
',', -1)
AS DECIMAL(20,2)
) AS SurfaceArea,
CAST(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
',', value),
',', -1)
AS UNSIGNED
) AS Population
FROM
Country, tinyint_asc
WHERE
tinyint_asc.value >= 1 AND tinyint_asc.value <= 5
GROUP BY
Continent, value
;
+---------------+----------------------------------------------+-------------+------------+
| Continent | Name | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia | China | 9572900.00 | 1277558000 |
| Asia | India | 3287263.00 | 1013662000 |
| Asia | Kazakstan | 2724900.00 | 16223000 |
| Asia | Saudi Arabia | 2149690.00 | 21607000 |
| Asia | Indonesia | 1904569.00 | 212107000 |
| Europe | Russian Federation | 17075400.00 | 146934000 |
| Europe | Ukraine | 603700.00 | 50456000 |
| Europe | France | 551500.00 | 59225700 |
| Europe | Spain | 505992.00 | 39441700 |
| Europe | Sweden | 449964.00 | 8861400 |
| North America | Canada | 9970610.00 | 31147000 |
| North America | United States | 9363520.00 | 278357000 |
| North America | Greenland | 2166090.00 | 56000 |
| North America | Mexico | 1958201.00 | 98881000 |
| North America | Nicaragua | 130000.00 | 5074000 |
| Africa | Sudan | 2505813.00 | 29490000 |
| Africa | Algeria | 2381741.00 | 31471000 |
| Africa | Congo | 2344858.00 | 51654000 |
| Africa | The Democratic Republic of the | 1759540.00 | 5605000 |
| Africa | Libyan Arab Jamahiriya | 1284000.00 | 7651000 |
| Oceania | Australia | 7741220.00 | 18886000 |
| Oceania | Papua New Guinea | 462840.00 | 4807000 |
| Oceania | New Zealand | 270534.00 | 3862000 |
| Oceania | Solomon Islands | 28896.00 | 444000 |
| Oceania | New Caledonia | 18575.00 | 214000 |
| Antarctica | Antarctica | 13120000.00 | 0 |
| Antarctica | French Southern territories | 7780.00 | 0 |
| Antarctica | South Georgia and the South Sandwich Islands | 3903.00 | 0 |
| Antarctica | Heard Island and McDonald Islands | 359.00 | 0 |
| Antarctica | Bouvet Island | 59.00 | 0 |
| South America | Brazil | 8547403.00 | 170115000 |
| South America | Argentina | 2780400.00 | 37032000 |
| South America | Peru | 1285216.00 | 25662000 |
| South America | Colombia | 1138914.00 | 42321000 |
| South America | Bolivia | 1098581.00 | 8329000 |
+---------------+----------------------------------------------+-------------+------------+
Limitations
You should have:
- Enough numbers in the numbers table (I've used 5 out of 255)
- Reasonable setting for group_concat_max_len (see this post). Actually it would be better to have a smaller value here, while you make sure it's large enough; this way you do not waste memory for large groups.
(*) This was two years ago! I'm getting old
Update: see also
Another hack at same problem: SQL: selecting top N records per group, another solution

There is no such country "The Democratic Republic of the" in Africa. It seems the full name of Congo is being displayed as second nation. There is another way of calculating top N records. Here...
http://oksoft.blogspot.com/2011/01/top-n-records-per-group.html
Link | January 6th, 2011 at 1:55 pm
Of course there's also a standard way of doing this (supported by Pg, Oracle, DB2 etc.) which is using window functions ...
Link | January 6th, 2011 at 2:45 pm
@Pabloj,
Quite so...
GROUP_CONCAT "solves" the missing features of window functions.
Link | January 6th, 2011 at 3:14 pm
@shantanu,
Your solution involves a table creation followed by INSERTs.
In fact, it involves copying the entire table. So this is a serious consideration IMHO.
Link | January 6th, 2011 at 3:16 pm
That means the total cost is SLOW
Link | January 6th, 2011 at 3:30 pm
@SQL
As I said, I happened to see it executes faster than the subquery alternatives.
Link | January 6th, 2011 at 4:11 pm
Love it. Brilliant.
Really good post.
Link | January 6th, 2011 at 6:34 pm
great post for group_concat and string walking
Link | January 7th, 2011 at 4:10 am
where is the world database ?
Link | January 7th, 2011 at 4:54 am
@huarong,
Seems like it has been moved. I've updated the link.
Link | January 7th, 2011 at 6:21 am
This is more like how I would do it. It could even be done without a join by adding a column to the existing table and filling it as shown above.
Link | January 7th, 2011 at 7:59 am
@dtecmeister,
Nice trick. Allow me to point you to two problems, though:
1. You must have write permissions on that table in order to provide a SELECT query
2. You must constantly and recurringly execute the query as data changes within the table. Whenever anyone updates/inserts, everything must be re-calculated.
@shantanu, the above applies to your solution as well.
Link | January 7th, 2011 at 9:16 am
Good solution! I would add in the table `tinyint_asc`
the column `value` as the primary key.
Link | January 7th, 2011 at 12:11 pm
@William,
Right you are. I've fixed the code
Link | January 7th, 2011 at 3:22 pm
Hi
Could You add "explain" result to the last query ?
Link | January 8th, 2011 at 3:54 pm
Link | January 8th, 2011 at 7:28 pm
Umm, the above is without any keys...
Link | January 8th, 2011 at 7:36 pm
Nice combo in the "extra" column.
Link | January 9th, 2011 at 2:44 am
We could do a contest: who gets as many "Extra" comments in a single query
There's no avoiding the "Using join buffer"m as we're essentially joining the tinyint_asc table to Country without an index.
Here's yet another possible execution plan for the above query.
This may soften the previous impression.
We only iterate the Country table once. For each row we join 4 rows from tinyint_asc using join buffer.
Since we're grouping by columns from two distinct tables, we can't utilize an index to work out the GROUP + ORDER BY.
Link | January 9th, 2011 at 8:47 am
So, the most reasonable solution is to save the query result in the cache (secondary table, memcached, file, etc). I'll check that last query. Maybe it will give me less than 80k+ Rows_examined (as mysql+percona log patch says in the slow query log).
Link | January 9th, 2011 at 6:45 pm
@januzi,
I'm not sure what it is you're trying to avoid or what your situation is.
The query *requests* that you scan the entire table. It says: "for all rows in the table, group them and sort them within the groups".
It's true that eventually we then only use top n rows per group. If your groups are small, this makes little difference. If your groups are large, it does.
Link | January 10th, 2011 at 6:03 am
Hi
Useful idea.
One minor suggestion would be to use a table with the numbers 0 to 9 which can then be repeatedly cross joined against itself to produce as big an integer as required.
Maybe a touch less efficient but more flexible and can then easily be used on several different queries.
Link | December 15th, 2011 at 4:40 pm
@Keith,
Yes, that's possible; there's a balance one would want to maintain. It's easy to store 1000 rows in a table, and is faster (verify) than doing 3 joins of 10 rows. Storing 100000 is not as efficient, and joins are required.
Link | December 15th, 2011 at 6:55 pm
MySQL/QueryScript use case: DELETE all but top N records per group | code.openark.org wrote:
[...] SQL: selecting top N records per group [...]
Link | February 9th, 2012 at 10:33 am
this is just what i needed - i suggest using char(1) rather the ',' tho
Link | May 14th, 2013 at 11:19 am
@Erich,
Yes, the use of commas assume no commas in values; I also sometimes use char(0) or char(9) which nobody uses within texts
Link | May 16th, 2013 at 7:14 am