A while back I presented(*) an SQL trick to present with non-aggregated column on a GROUP BY query, without use of subquery or derived tables.
Based on a similar concept, combined with string walking, I now present a query which selects top-n records for each group, ordered by some condition. It will require no subqueries. It executes faster than its more conventional alternatives.
[UPDATE: this is MySQL only. Others can use Window Functions where available]
Using the simple world database, we answer the following question:
What are the top 5 largest (by area) countries for each continent? What are their names, surface area and population?
Similar questions would be:
What were the latest 5 films rented by each customer?
What were the most presented advertisements for each user?
etc.
Step 1: getting the top
We already know how to get a single column’s value for the top country, as presented in the aforementioned post:
SELECT Continent, SUBSTRING_INDEX( GROUP_CONCAT(Name ORDER BY SurfaceArea DESC), ',', 1) AS Name FROM Country GROUP BY Continent ; +---------------+--------------------+ | Continent | Name | +---------------+--------------------+ | Asia | China | | Europe | Russian Federation | | North America | Canada | | Africa | Sudan | | Oceania | Australia | | Antarctica | Antarctica | | South America | Brazil | +---------------+--------------------+
Step 2: adding columns
This part is easy: just throw in the rest of the columns (again, only indicating the top country in each continent)
SELECT Continent, SUBSTRING_INDEX( GROUP_CONCAT(Name ORDER BY SurfaceArea DESC), ',', 1) AS Name, SUBSTRING_INDEX( GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC), ',', 1) AS SurfaceArea, SUBSTRING_INDEX( GROUP_CONCAT(Population ORDER BY SurfaceArea DESC), ',', 1) AS Population FROM Country GROUP BY Continent ; +---------------+--------------------+-------------+------------+ | Continent | Name | SurfaceArea | Population | +---------------+--------------------+-------------+------------+ | Asia | China | 9572900.00 | 1277558000 | | Europe | Russian Federation | 17075400.00 | 146934000 | | North America | Canada | 9970610.00 | 31147000 | | Africa | Sudan | 2505813.00 | 29490000 | | Oceania | Australia | 7741220.00 | 18886000 | | Antarctica | Antarctica | 13120000.00 | 0 | | South America | Brazil | 8547403.00 | 170115000 | +---------------+--------------------+-------------+------------+
Step 3: casting
You’ll notice that the Population column from this last execution is aligned to the left. This is because it is believed to be a string. The GROUP_CONCAT clause concatenates values in one string, and SUBSTRING_INDEX parses a substring. The same applies to the SurfaceArea column. We’ll cast Population as UNSIGNED and SurfaceArea as DECIMAL:
SELECT Continent, SUBSTRING_INDEX( GROUP_CONCAT(Name ORDER BY SurfaceArea DESC), ',', 1) AS Name, CAST( SUBSTRING_INDEX( GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC), ',', 1) AS DECIMAL(20,2) ) AS SurfaceArea, CAST( SUBSTRING_INDEX( GROUP_CONCAT(Population ORDER BY SurfaceArea DESC), ',', 1) AS UNSIGNED ) AS Population FROM Country GROUP BY Continent ; +---------------+--------------------+-------------+------------+ | Continent | Name | SurfaceArea | Population | +---------------+--------------------+-------------+------------+ | Asia | China | 9572900.00 | 1277558000 | | Europe | Russian Federation | 17075400.00 | 146934000 | | North America | Canada | 9970610.00 | 31147000 | | Africa | Sudan | 2505813.00 | 29490000 | | Oceania | Australia | 7741220.00 | 18886000 | | Antarctica | Antarctica | 13120000.00 | 0 | | South America | Brazil | 8547403.00 | 170115000 | +---------------+--------------------+-------------+------------+
Step 4: top n records
It’s time to use string walking. Examples for string walking (described in the excellent SQL Cookbook) can be found here, here and here. We’ll be using a numbers table: a simple table which lists ascending integer numbers. For example, you can use the following:
DROP TABLE IF EXISTS `tinyint_asc`; CREATE TABLE `tinyint_asc` ( `value` tinyint(3) unsigned NOT NULL default '0', PRIMARY KEY (value) ) ; INSERT INTO `tinyint_asc` VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255);
The trick is to apply the same technique as used above, not for a single row, but for several rows. Here’s how to present the top 5 countries:
SELECT Continent, SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT(Name ORDER BY SurfaceArea DESC), ',', value), ',', -1) AS Name, CAST( SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC), ',', value), ',', -1) AS DECIMAL(20,2) ) AS SurfaceArea, CAST( SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT(Population ORDER BY SurfaceArea DESC), ',', value), ',', -1) AS UNSIGNED ) AS Population FROM Country, tinyint_asc WHERE tinyint_asc.value >= 1 AND tinyint_asc.value <= 5 GROUP BY Continent, value ; +---------------+----------------------------------------------+-------------+------------+ | Continent | Name | SurfaceArea | Population | +---------------+----------------------------------------------+-------------+------------+ | Asia | China | 9572900.00 | 1277558000 | | Asia | India | 3287263.00 | 1013662000 | | Asia | Kazakstan | 2724900.00 | 16223000 | | Asia | Saudi Arabia | 2149690.00 | 21607000 | | Asia | Indonesia | 1904569.00 | 212107000 | | Europe | Russian Federation | 17075400.00 | 146934000 | | Europe | Ukraine | 603700.00 | 50456000 | | Europe | France | 551500.00 | 59225700 | | Europe | Spain | 505992.00 | 39441700 | | Europe | Sweden | 449964.00 | 8861400 | | North America | Canada | 9970610.00 | 31147000 | | North America | United States | 9363520.00 | 278357000 | | North America | Greenland | 2166090.00 | 56000 | | North America | Mexico | 1958201.00 | 98881000 | | North America | Nicaragua | 130000.00 | 5074000 | | Africa | Sudan | 2505813.00 | 29490000 | | Africa | Algeria | 2381741.00 | 31471000 | | Africa | Congo | 2344858.00 | 51654000 | | Africa | The Democratic Republic of the | 1759540.00 | 5605000 | | Africa | Libyan Arab Jamahiriya | 1284000.00 | 7651000 | | Oceania | Australia | 7741220.00 | 18886000 | | Oceania | Papua New Guinea | 462840.00 | 4807000 | | Oceania | New Zealand | 270534.00 | 3862000 | | Oceania | Solomon Islands | 28896.00 | 444000 | | Oceania | New Caledonia | 18575.00 | 214000 | | Antarctica | Antarctica | 13120000.00 | 0 | | Antarctica | French Southern territories | 7780.00 | 0 | | Antarctica | South Georgia and the South Sandwich Islands | 3903.00 | 0 | | Antarctica | Heard Island and McDonald Islands | 359.00 | 0 | | Antarctica | Bouvet Island | 59.00 | 0 | | South America | Brazil | 8547403.00 | 170115000 | | South America | Argentina | 2780400.00 | 37032000 | | South America | Peru | 1285216.00 | 25662000 | | South America | Colombia | 1138914.00 | 42321000 | | South America | Bolivia | 1098581.00 | 8329000 | +---------------+----------------------------------------------+-------------+------------+
Limitations
You should have:
- Enough numbers in the numbers table (I’ve used 5 out of 255)
- Reasonable setting for group_concat_max_len (see this post). Actually it would be better to have a smaller value here, while you make sure it’s large enough; this way you do not waste memory for large groups.
(*) This was two years ago! I’m getting old
Update: see also
Another hack at same problem: SQL: selecting top N records per group, another solution
@januzi,
I’m not sure what it is you’re trying to avoid or what your situation is.
The query *requests* that you scan the entire table. It says: “for all rows in the table, group them and sort them within the groups”.
It’s true that eventually we then only use top n rows per group. If your groups are small, this makes little difference. If your groups are large, it does.
Hi
Useful idea.
One minor suggestion would be to use a table with the numbers 0 to 9 which can then be repeatedly cross joined against itself to produce as big an integer as required.
Maybe a touch less efficient but more flexible and can then easily be used on several different queries.
@Keith,
Yes, that’s possible; there’s a balance one would want to maintain. It’s easy to store 1000 rows in a table, and is faster (verify) than doing 3 joins of 10 rows. Storing 100000 is not as efficient, and joins are required.
this is just what i needed – i suggest using char(1) rather the ‘,’ tho