SQL: selecting top N records per group, another solution

A while back I presented SQL: selecting top N records per group, a “give me the top 5 countries in each continent” type of query, and which used an external numbers table and a lot of tedious casting.

Here’s another solution I came up with (*). Still using GROUP_CONCAT (how else?), but no external table and no casting. The query outputs the largest 5 countries (by surface area) per continent.

SELECT
  Continent,
  Name,
  SurfaceArea,
  Population
FROM
  world.Country,
  (
    SELECT 
      GROUP_CONCAT(top_codes_per_group) AS top_codes
    FROM
      (
        SELECT 
          SUBSTRING_INDEX(GROUP_CONCAT(Code ORDER BY SurfaceArea DESC), ',', 5) AS top_codes_per_group
        FROM
          world.Country
        GROUP BY
          Continent
      ) s_top_codes_per_group
  ) s_top_codes
WHERE
  FIND_IN_SET(Code, top_codes)
ORDER BY
  Continent,
  SurfaceArea DESC
;

+---------------+----------------------------------------------+-------------+------------+
| Continent     | Name                                         | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia          | China                                        |  9572900.00 | 1277558000 |
| Asia          | India                                        |  3287263.00 | 1013662000 |
| Asia          | Kazakstan                                    |  2724900.00 |   16223000 |
| Asia          | Saudi Arabia                                 |  2149690.00 |   21607000 |
| Asia          | Indonesia                                    |  1904569.00 |  212107000 |
| Europe        | Russian Federation                           | 17075400.00 |  146934000 |
| Europe        | Ukraine                                      |   603700.00 |   50456000 |
| Europe        | France                                       |   551500.00 |   59225700 |
| Europe        | Spain                                        |   505992.00 |   39441700 |
| Europe        | Sweden                                       |   449964.00 |    8861400 |
| North America | Canada                                       |  9970610.00 |   31147000 |
| North America | United States                                |  9363520.00 |  278357000 |
| North America | Greenland                                    |  2166090.00 |      56000 |
| North America | Mexico                                       |  1958201.00 |   98881000 |
| North America | Nicaragua                                    |   130000.00 |    5074000 |
| Africa        | Sudan                                        |  2505813.00 |   29490000 |
| Africa        | Algeria                                      |  2381741.00 |   31471000 |
| Africa        | Congo, The Democratic Republic of the        |  2344858.00 |   51654000 |
| Africa        | Libyan Arab Jamahiriya                       |  1759540.00 |    5605000 |
| Africa        | Chad                                         |  1284000.00 |    7651000 |
| Oceania       | Australia                                    |  7741220.00 |   18886000 |
| Oceania       | Papua New Guinea                             |   462840.00 |    4807000 |
| Oceania       | New Zealand                                  |   270534.00 |    3862000 |
| Oceania       | Solomon Islands                              |    28896.00 |     444000 |
| Oceania       | New Caledonia                                |    18575.00 |     214000 |
| Antarctica    | Antarctica                                   | 13120000.00 |          0 |
| Antarctica    | French Southern territories                  |     7780.00 |          0 |
| Antarctica    | South Georgia and the South Sandwich Islands |     3903.00 |          0 |
| Antarctica    | Heard Island and McDonald Islands            |      359.00 |          0 |
| Antarctica    | Bouvet Island                                |       59.00 |          0 |
| South America | Brazil                                       |  8547403.00 |  170115000 |
| South America | Argentina                                    |  2780400.00 |   37032000 |
| South America | Peru                                         |  1285216.00 |   25662000 |
| South America | Colombia                                     |  1138914.00 |   42321000 |
| South America | Bolivia                                      |  1098581.00 |    8329000 |
+---------------+----------------------------------------------+-------------+------------+

In bold are the conditions by which we nominate our selected rows (condition is SurfaceArea DESC, number of rows is 5, so 5 largest countries).

What’s going on here?

So the inner s_top_codes_per_group query produces the codes for largest countries per continent:

+---------------------+
| top_codes_per_group |
+---------------------+
| CHN,IND,KAZ,SAU,IDN |
| RUS,UKR,FRA,ESP,SWE |
| CAN,USA,GRL,MEX,NIC |
| SDN,DZA,COD,LBY,TCD |
| AUS,PNG,NZL,SLB,NCL |
| ATA,ATF,SGS,HMD,BVT |
| BRA,ARG,PER,COL,BOL |
+---------------------+

The wrapping s_top_codes query concatenates all the above to one long text:

+---------------------------------------------------------------------------------------------------------------------------------------------+
| top_codes                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------+
| CHN,IND,KAZ,SAU,IDN,RUS,UKR,FRA,ESP,SWE,CAN,USA,GRL,MEX,NIC,SDN,DZA,COD,LBY,TCD,AUS,PNG,NZL,SLB,NCL,ATA,ATF,SGS,HMD,BVT,BRA,ARG,PER,COL,BOL |
+---------------------------------------------------------------------------------------------------------------------------------------------+

And the final query simply demands that Code must be found within this string, by calling upon FIND_IN_SET(Code, top_codes).

Notes

  • This solution works for PRIMARY KEYs or otherwise UNIQUE KEYs of all sorts (a CHAR(3) in our example, but same for integers etc.)
  • And you still have to have a sufficient group_concat_max_len (see this post). You must have a large enough value to fit in the very long text you may be expecting in s_top_codes.
  • Performance-wise there are full scans here, as well as string searching.

* UPDATE

I should pay closer attention. This comment had it 5 years ago.

10
Leave a Reply

avatar
8 Comment threads
2 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
5 Comment authors
Rick JamesstephenRick JamesshlomiØystein Grøvlen Recent comment authors

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of
Øystein Grøvlen
Guest

Very nice, but the population numbers do not look right …

Rick James
Guest
Rick James

The solution may have trouble with integers…
123,234,1579,…
“157” and “579” will match that, probably incorrectly.

Other approaches are found here:
http://forums.mysql.com/read.php?20,537167,537521
and
http://www.artfulsoftware.com/queries.php

There used to be a way to do the “top N” with nested queries, but MariaDB (at least) has optimized away the “feature” it depended on.

trackback

[…] I like GROUP_CONCAT, and the many hacks it provides: [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. But it makes for a poor substitution to Window Functions, and only solves a subset of […]

stephen
Guest
stephen

is there a way to do a cumulative sum over the window? for example show cumsum of population over each continent sorting countries by name. a more natural application is cumsum over sales rep sorted by time of sale. eg translate into MySQL this Netezza synatx:
select
sum(col1) over
(partition by col2, col3 order by col4 rows between unbounded
preceding and current row) as sum1 from table1 ;

Rick James
Guest
Rick James

The GROUP_CONCAT trick has a caveat — there is a default limit of 1024 bytes in the string. (That limit can be changed.)

Another approach is given here:
https://mariadb.com/kb/en/group-by-trick-has-been-optimized-away/
However, for MariaDB, one must do this to keep the optimizer from doing too good a job:
set @@session.optimizer_switch=’derived_merge=off,derived_with_keys=off’;

Rick James
Guest

That may be the fastest “Top N” query. I added it to my short collection: http://mysql.rjweb.org/doc.php/groupwise_max

It does have a caveat that may make it not work for some use cases: `Code` must be unique across the entire table.