A while back I presented SQL: selecting top N records per group, a “give me the top 5 countries in each continent” type of query, and which used an external numbers table and a lot of tedious casting.
Here’s another solution I came up with (*). Still using GROUP_CONCAT (how else?), but no external table and no casting. The query outputs the largest 5 countries (by surface area) per continent.
SELECT Continent, Name, SurfaceArea, Population FROM world.Country, ( SELECT GROUP_CONCAT(top_codes_per_group) AS top_codes FROM ( SELECT SUBSTRING_INDEX(GROUP_CONCAT(Code ORDER BY SurfaceArea DESC), ',', 5) AS top_codes_per_group FROM world.Country GROUP BY Continent ) s_top_codes_per_group ) s_top_codes WHERE FIND_IN_SET(Code, top_codes) ORDER BY Continent, SurfaceArea DESC ; +---------------+----------------------------------------------+-------------+------------+ | Continent | Name | SurfaceArea | Population | +---------------+----------------------------------------------+-------------+------------+ | Asia | China | 9572900.00 | 1277558000 | | Asia | India | 3287263.00 | 1013662000 | | Asia | Kazakstan | 2724900.00 | 16223000 | | Asia | Saudi Arabia | 2149690.00 | 21607000 | | Asia | Indonesia | 1904569.00 | 212107000 | | Europe | Russian Federation | 17075400.00 | 146934000 | | Europe | Ukraine | 603700.00 | 50456000 | | Europe | France | 551500.00 | 59225700 | | Europe | Spain | 505992.00 | 39441700 | | Europe | Sweden | 449964.00 | 8861400 | | North America | Canada | 9970610.00 | 31147000 | | North America | United States | 9363520.00 | 278357000 | | North America | Greenland | 2166090.00 | 56000 | | North America | Mexico | 1958201.00 | 98881000 | | North America | Nicaragua | 130000.00 | 5074000 | | Africa | Sudan | 2505813.00 | 29490000 | | Africa | Algeria | 2381741.00 | 31471000 | | Africa | Congo, The Democratic Republic of the | 2344858.00 | 51654000 | | Africa | Libyan Arab Jamahiriya | 1759540.00 | 5605000 | | Africa | Chad | 1284000.00 | 7651000 | | Oceania | Australia | 7741220.00 | 18886000 | | Oceania | Papua New Guinea | 462840.00 | 4807000 | | Oceania | New Zealand | 270534.00 | 3862000 | | Oceania | Solomon Islands | 28896.00 | 444000 | | Oceania | New Caledonia | 18575.00 | 214000 | | Antarctica | Antarctica | 13120000.00 | 0 | | Antarctica | French Southern territories | 7780.00 | 0 | | Antarctica | South Georgia and the South Sandwich Islands | 3903.00 | 0 | | Antarctica | Heard Island and McDonald Islands | 359.00 | 0 | | Antarctica | Bouvet Island | 59.00 | 0 | | South America | Brazil | 8547403.00 | 170115000 | | South America | Argentina | 2780400.00 | 37032000 | | South America | Peru | 1285216.00 | 25662000 | | South America | Colombia | 1138914.00 | 42321000 | | South America | Bolivia | 1098581.00 | 8329000 | +---------------+----------------------------------------------+-------------+------------+
In bold are the conditions by which we nominate our selected rows (condition is SurfaceArea DESC, number of rows is 5, so 5 largest countries).
What’s going on here?
So the inner s_top_codes_per_group query produces the codes for largest countries per continent:
+---------------------+ | top_codes_per_group | +---------------------+ | CHN,IND,KAZ,SAU,IDN | | RUS,UKR,FRA,ESP,SWE | | CAN,USA,GRL,MEX,NIC | | SDN,DZA,COD,LBY,TCD | | AUS,PNG,NZL,SLB,NCL | | ATA,ATF,SGS,HMD,BVT | | BRA,ARG,PER,COL,BOL | +---------------------+
The wrapping s_top_codes query concatenates all the above to one long text:
+---------------------------------------------------------------------------------------------------------------------------------------------+ | top_codes | +---------------------------------------------------------------------------------------------------------------------------------------------+ | CHN,IND,KAZ,SAU,IDN,RUS,UKR,FRA,ESP,SWE,CAN,USA,GRL,MEX,NIC,SDN,DZA,COD,LBY,TCD,AUS,PNG,NZL,SLB,NCL,ATA,ATF,SGS,HMD,BVT,BRA,ARG,PER,COL,BOL | +---------------------------------------------------------------------------------------------------------------------------------------------+
And the final query simply demands that Code must be found within this string, by calling upon FIND_IN_SET(Code, top_codes).
Notes
- This solution works for PRIMARY KEYs or otherwise UNIQUE KEYs of all sorts (a CHAR(3) in our example, but same for integers etc.)
- And you still have to have a sufficient group_concat_max_len (see this post). You must have a large enough value to fit in the very long text you may be expecting in s_top_codes.
- Performance-wise there are full scans here, as well as string searching.
* UPDATE
I should pay closer attention. This comment had it 5 years ago.
Very nice, but the population numbers do not look right …
@Øystein,
Thanks – fixed in text. Used a playground database with modified values. After re-importing world.sql population values are now correct.
The solution may have trouble with integers…
123,234,1579,…
“157” and “579” will match that, probably incorrectly.
Other approaches are found here:
http://forums.mysql.com/read.php?20,537167,537521
and
http://www.artfulsoftware.com/queries.php
There used to be a way to do the “top N” with nested queries, but MariaDB (at least) has optimized away the “feature” it depended on.
@Rick,
No, no problem with these integers. See FIND_IN_SET.
It’s not a LIKE ‘%157%’ expression. It looks for comma delimited values.
I personally do not like queries which require me to create new tables and INSERT rows onto them. Such a solution is valid, of course, but I prefer to avoid writing. This allows me better and stricter privileges rules for users, among other issues.