A while back I presented SQL: selecting top N records per group, a “give me the top 5 countries in each continent” type of query, and which used an external numbers table and a lot of tedious casting.
Here’s another solution I came up with (*). Still using GROUP_CONCAT (how else?), but no external table and no casting. The query outputs the largest 5 countries (by surface area) per continent.
SELECT Continent, Name, SurfaceArea, Population FROM world.Country, ( SELECT GROUP_CONCAT(top_codes_per_group) AS top_codes FROM ( SELECT SUBSTRING_INDEX(GROUP_CONCAT(Code ORDER BY SurfaceArea DESC), ',', 5) AS top_codes_per_group FROM world.Country GROUP BY Continent ) s_top_codes_per_group ) s_top_codes WHERE FIND_IN_SET(Code, top_codes) ORDER BY Continent, SurfaceArea DESC ; +---------------+----------------------------------------------+-------------+------------+ | Continent | Name | SurfaceArea | Population | +---------------+----------------------------------------------+-------------+------------+ | Asia | China | 9572900.00 | 1277558000 | | Asia | India | 3287263.00 | 1013662000 | | Asia | Kazakstan | 2724900.00 | 16223000 | | Asia | Saudi Arabia | 2149690.00 | 21607000 | | Asia | Indonesia | 1904569.00 | 212107000 | | Europe | Russian Federation | 17075400.00 | 146934000 | | Europe | Ukraine | 603700.00 | 50456000 | | Europe | France | 551500.00 | 59225700 | | Europe | Spain | 505992.00 | 39441700 | | Europe | Sweden | 449964.00 | 8861400 | | North America | Canada | 9970610.00 | 31147000 | | North America | United States | 9363520.00 | 278357000 | | North America | Greenland | 2166090.00 | 56000 | | North America | Mexico | 1958201.00 | 98881000 | | North America | Nicaragua | 130000.00 | 5074000 | | Africa | Sudan | 2505813.00 | 29490000 | | Africa | Algeria | 2381741.00 | 31471000 | | Africa | Congo, The Democratic Republic of the | 2344858.00 | 51654000 | | Africa | Libyan Arab Jamahiriya | 1759540.00 | 5605000 | | Africa | Chad | 1284000.00 | 7651000 | | Oceania | Australia | 7741220.00 | 18886000 | | Oceania | Papua New Guinea | 462840.00 | 4807000 | | Oceania | New Zealand | 270534.00 | 3862000 | | Oceania | Solomon Islands | 28896.00 | 444000 | | Oceania | New Caledonia | 18575.00 | 214000 | | Antarctica | Antarctica | 13120000.00 | 0 | | Antarctica | French Southern territories | 7780.00 | 0 | | Antarctica | South Georgia and the South Sandwich Islands | 3903.00 | 0 | | Antarctica | Heard Island and McDonald Islands | 359.00 | 0 | | Antarctica | Bouvet Island | 59.00 | 0 | | South America | Brazil | 8547403.00 | 170115000 | | South America | Argentina | 2780400.00 | 37032000 | | South America | Peru | 1285216.00 | 25662000 | | South America | Colombia | 1138914.00 | 42321000 | | South America | Bolivia | 1098581.00 | 8329000 | +---------------+----------------------------------------------+-------------+------------+
In bold are the conditions by which we nominate our selected rows (condition is SurfaceArea DESC, number of rows is 5, so 5 largest countries).
What’s going on here?
So the inner s_top_codes_per_group query produces the codes for largest countries per continent:
+---------------------+ | top_codes_per_group | +---------------------+ | CHN,IND,KAZ,SAU,IDN | | RUS,UKR,FRA,ESP,SWE | | CAN,USA,GRL,MEX,NIC | | SDN,DZA,COD,LBY,TCD | | AUS,PNG,NZL,SLB,NCL | | ATA,ATF,SGS,HMD,BVT | | BRA,ARG,PER,COL,BOL | +---------------------+
The wrapping s_top_codes query concatenates all the above to one long text:
+---------------------------------------------------------------------------------------------------------------------------------------------+ | top_codes | +---------------------------------------------------------------------------------------------------------------------------------------------+ | CHN,IND,KAZ,SAU,IDN,RUS,UKR,FRA,ESP,SWE,CAN,USA,GRL,MEX,NIC,SDN,DZA,COD,LBY,TCD,AUS,PNG,NZL,SLB,NCL,ATA,ATF,SGS,HMD,BVT,BRA,ARG,PER,COL,BOL | +---------------------------------------------------------------------------------------------------------------------------------------------+
And the final query simply demands that Code must be found within this string, by calling upon FIND_IN_SET(Code, top_codes).
Notes
- This solution works for PRIMARY KEYs or otherwise UNIQUE KEYs of all sorts (a CHAR(3) in our example, but same for integers etc.)
- And you still have to have a sufficient group_concat_max_len (see this post). You must have a large enough value to fit in the very long text you may be expecting in s_top_codes.
- Performance-wise there are full scans here, as well as string searching.
* UPDATE
I should pay closer attention. This comment had it 5 years ago.
is there a way to do a cumulative sum over the window? for example show cumsum of population over each continent sorting countries by name. a more natural application is cumsum over sales rep sorted by time of sale. eg translate into MySQL this Netezza synatx:
select
sum(col1) over
(partition by col2, col3 order by col4 rows between unbounded
preceding and current row) as sum1 from table1 ;
@stephen, not really. As you know mysql does not have window functions. Mimicking the above has no immediate solution.
Well, you could write a sub select per such row. I once compiled some window functions alternatives. Need to pull it from wherever it is
The GROUP_CONCAT trick has a caveat — there is a default limit of 1024 bytes in the string. (That limit can be changed.)
Another approach is given here:
https://mariadb.com/kb/en/group-by-trick-has-been-optimized-away/
However, for MariaDB, one must do this to keep the optimizer from doing too good a job:
set @@session.optimizer_switch=’derived_merge=off,derived_with_keys=off’;
That may be the fastest “Top N” query. I added it to my short collection: http://mysql.rjweb.org/doc.php/groupwise_max
It does have a caveat that may make it not work for some use cases: `Code` must be unique across the entire table.