SQL: selecting top N records per group, another solution

A while back I presented SQL: selecting top N records per group, a “give me the top 5 countries in each continent” type of query, and which used an external numbers table and a lot of tedious casting.

Here’s another solution I came up with (*). Still using GROUP_CONCAT (how else?), but no external table and no casting. The query outputs the largest 5 countries (by surface area) per continent.

SELECT
  Continent,
  Name,
  SurfaceArea,
  Population
FROM
  world.Country,
  (
    SELECT 
      GROUP_CONCAT(top_codes_per_group) AS top_codes
    FROM
      (
        SELECT 
          SUBSTRING_INDEX(GROUP_CONCAT(Code ORDER BY SurfaceArea DESC), ',', 5) AS top_codes_per_group
        FROM
          world.Country
        GROUP BY
          Continent
      ) s_top_codes_per_group
  ) s_top_codes
WHERE
  FIND_IN_SET(Code, top_codes)
ORDER BY
  Continent,
  SurfaceArea DESC
;

+---------------+----------------------------------------------+-------------+------------+
| Continent     | Name                                         | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia          | China                                        |  9572900.00 | 1277558000 |
| Asia          | India                                        |  3287263.00 | 1013662000 |
| Asia          | Kazakstan                                    |  2724900.00 |   16223000 |
| Asia          | Saudi Arabia                                 |  2149690.00 |   21607000 |
| Asia          | Indonesia                                    |  1904569.00 |  212107000 |
| Europe        | Russian Federation                           | 17075400.00 |  146934000 |
| Europe        | Ukraine                                      |   603700.00 |   50456000 |
| Europe        | France                                       |   551500.00 |   59225700 |
| Europe        | Spain                                        |   505992.00 |   39441700 |
| Europe        | Sweden                                       |   449964.00 |    8861400 |
| North America | Canada                                       |  9970610.00 |   31147000 |
| North America | United States                                |  9363520.00 |  278357000 |
| North America | Greenland                                    |  2166090.00 |      56000 |
| North America | Mexico                                       |  1958201.00 |   98881000 |
| North America | Nicaragua                                    |   130000.00 |    5074000 |
| Africa        | Sudan                                        |  2505813.00 |   29490000 |
| Africa        | Algeria                                      |  2381741.00 |   31471000 |
| Africa        | Congo, The Democratic Republic of the        |  2344858.00 |   51654000 |
| Africa        | Libyan Arab Jamahiriya                       |  1759540.00 |    5605000 |
| Africa        | Chad                                         |  1284000.00 |    7651000 |
| Oceania       | Australia                                    |  7741220.00 |   18886000 |
| Oceania       | Papua New Guinea                             |   462840.00 |    4807000 |
| Oceania       | New Zealand                                  |   270534.00 |    3862000 |
| Oceania       | Solomon Islands                              |    28896.00 |     444000 |
| Oceania       | New Caledonia                                |    18575.00 |     214000 |
| Antarctica    | Antarctica                                   | 13120000.00 |          0 |
| Antarctica    | French Southern territories                  |     7780.00 |          0 |
| Antarctica    | South Georgia and the South Sandwich Islands |     3903.00 |          0 |
| Antarctica    | Heard Island and McDonald Islands            |      359.00 |          0 |
| Antarctica    | Bouvet Island                                |       59.00 |          0 |
| South America | Brazil                                       |  8547403.00 |  170115000 |
| South America | Argentina                                    |  2780400.00 |   37032000 |
| South America | Peru                                         |  1285216.00 |   25662000 |
| South America | Colombia                                     |  1138914.00 |   42321000 |
| South America | Bolivia                                      |  1098581.00 |    8329000 |
+---------------+----------------------------------------------+-------------+------------+

In bold are the conditions by which we nominate our selected rows (condition is SurfaceArea DESC, number of rows is 5, so 5 largest countries).

What’s going on here?

So the inner s_top_codes_per_group query produces the codes for largest countries per continent:

+---------------------+
| top_codes_per_group |
+---------------------+
| CHN,IND,KAZ,SAU,IDN |
| RUS,UKR,FRA,ESP,SWE |
| CAN,USA,GRL,MEX,NIC |
| SDN,DZA,COD,LBY,TCD |
| AUS,PNG,NZL,SLB,NCL |
| ATA,ATF,SGS,HMD,BVT |
| BRA,ARG,PER,COL,BOL |
+---------------------+

The wrapping s_top_codes query concatenates all the above to one long text:

+---------------------------------------------------------------------------------------------------------------------------------------------+
| top_codes                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------+
| CHN,IND,KAZ,SAU,IDN,RUS,UKR,FRA,ESP,SWE,CAN,USA,GRL,MEX,NIC,SDN,DZA,COD,LBY,TCD,AUS,PNG,NZL,SLB,NCL,ATA,ATF,SGS,HMD,BVT,BRA,ARG,PER,COL,BOL |
+---------------------------------------------------------------------------------------------------------------------------------------------+

And the final query simply demands that Code must be found within this string, by calling upon FIND_IN_SET(Code, top_codes).

Notes

This solution works for PRIMARY KEYs or otherwise UNIQUE KEYs of all sorts (a CHAR(3) in our example, but same for integers etc.)
And you still have to have a sufficient group_concat_max_len (see this post). You must have a large enough value to fit in the very long text you may be expecting in s_top_codes.
Performance-wise there are full scans here, as well as string searching.

* UPDATE

I should pay closer attention. This comment had it 5 years ago.

10 thoughts on “SQL: selecting top N records per group, another solution”

stephen says:

September 19, 2013 at 3:48 pm

is there a way to do a cumulative sum over the window? for example show cumsum of population over each continent sorting countries by name. a more natural application is cumsum over sales rep sorted by time of sale. eg translate into MySQL this Netezza synatx:
select
sum(col1) over
(partition by col2, col3 order by col4 rows between unbounded
preceding and current row) as sum1 from table1 ;
shlomi says:

September 19, 2013 at 4:26 pm

@stephen, not really. As you know mysql does not have window functions. Mimicking the above has no immediate solution.
shlomi says:

September 19, 2013 at 4:27 pm

Well, you could write a sub select per such row. I once compiled some window functions alternatives. Need to pull it from wherever it is
Rick James says:

September 19, 2013 at 11:10 pm

The GROUP_CONCAT trick has a caveat — there is a default limit of 1024 bytes in the string. (That limit can be changed.)

Another approach is given here:
https://mariadb.com/kb/en/group-by-trick-has-been-optimized-away/
However, for MariaDB, one must do this to keep the optimizer from doing too good a job:
set @@session.optimizer_switch=’derived_merge=off,derived_with_keys=off’;
Rick James says:

April 17, 2016 at 3:52 am

That may be the fastest “Top N” query. I added it to my short collection: http://mysql.rjweb.org/doc.php/groupwise_max

It does have a caveat that may make it not work for some use cases: `Code` must be unique across the entire table.