SQL: selecting top N records per group, another solution

A while back I presented SQL: selecting top N records per group, a “give me the top 5 countries in each continent” type of query, and which used an external numbers table and a lot of tedious casting.

Here’s another solution I came up with (*). Still using GROUP_CONCAT (how else?), but no external table and no casting. The query outputs the largest 5 countries (by surface area) per continent.

SELECT
  Continent,
  Name,
  SurfaceArea,
  Population
FROM
  world.Country,
  (
    SELECT 
      GROUP_CONCAT(top_codes_per_group) AS top_codes
    FROM
      (
        SELECT 
          SUBSTRING_INDEX(GROUP_CONCAT(Code ORDER BY SurfaceArea DESC), ',', 5) AS top_codes_per_group
        FROM
          world.Country
        GROUP BY
          Continent
      ) s_top_codes_per_group
  ) s_top_codes
WHERE
  FIND_IN_SET(Code, top_codes)
ORDER BY
  Continent,
  SurfaceArea DESC
;

+---------------+----------------------------------------------+-------------+------------+
| Continent     | Name                                         | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia          | China                                        |  9572900.00 | 1277558000 |
| Asia          | India                                        |  3287263.00 | 1013662000 |
| Asia          | Kazakstan                                    |  2724900.00 |   16223000 |
| Asia          | Saudi Arabia                                 |  2149690.00 |   21607000 |
| Asia          | Indonesia                                    |  1904569.00 |  212107000 |
| Europe        | Russian Federation                           | 17075400.00 |  146934000 |
| Europe        | Ukraine                                      |   603700.00 |   50456000 |
| Europe        | France                                       |   551500.00 |   59225700 |
| Europe        | Spain                                        |   505992.00 |   39441700 |
| Europe        | Sweden                                       |   449964.00 |    8861400 |
| North America | Canada                                       |  9970610.00 |   31147000 |
| North America | United States                                |  9363520.00 |  278357000 |
| North America | Greenland                                    |  2166090.00 |      56000 |
| North America | Mexico                                       |  1958201.00 |   98881000 |
| North America | Nicaragua                                    |   130000.00 |    5074000 |
| Africa        | Sudan                                        |  2505813.00 |   29490000 |
| Africa        | Algeria                                      |  2381741.00 |   31471000 |
| Africa        | Congo, The Democratic Republic of the        |  2344858.00 |   51654000 |
| Africa        | Libyan Arab Jamahiriya                       |  1759540.00 |    5605000 |
| Africa        | Chad                                         |  1284000.00 |    7651000 |
| Oceania       | Australia                                    |  7741220.00 |   18886000 |
| Oceania       | Papua New Guinea                             |   462840.00 |    4807000 |
| Oceania       | New Zealand                                  |   270534.00 |    3862000 |
| Oceania       | Solomon Islands                              |    28896.00 |     444000 |
| Oceania       | New Caledonia                                |    18575.00 |     214000 |
| Antarctica    | Antarctica                                   | 13120000.00 |          0 |
| Antarctica    | French Southern territories                  |     7780.00 |          0 |
| Antarctica    | South Georgia and the South Sandwich Islands |     3903.00 |          0 |
| Antarctica    | Heard Island and McDonald Islands            |      359.00 |          0 |
| Antarctica    | Bouvet Island                                |       59.00 |          0 |
| South America | Brazil                                       |  8547403.00 |  170115000 |
| South America | Argentina                                    |  2780400.00 |   37032000 |
| South America | Peru                                         |  1285216.00 |   25662000 |
| South America | Colombia                                     |  1138914.00 |   42321000 |
| South America | Bolivia                                      |  1098581.00 |    8329000 |
+---------------+----------------------------------------------+-------------+------------+

In bold are the conditions by which we nominate our selected rows (condition is SurfaceArea DESC, number of rows is 5, so 5 largest countries).

What’s going on here?

So the inner s_top_codes_per_group query produces the codes for largest countries per continent:

+---------------------+
| top_codes_per_group |
+---------------------+
| CHN,IND,KAZ,SAU,IDN |
| RUS,UKR,FRA,ESP,SWE |
| CAN,USA,GRL,MEX,NIC |
| SDN,DZA,COD,LBY,TCD |
| AUS,PNG,NZL,SLB,NCL |
| ATA,ATF,SGS,HMD,BVT |
| BRA,ARG,PER,COL,BOL |
+---------------------+

The wrapping s_top_codes query concatenates all the above to one long text:

+---------------------------------------------------------------------------------------------------------------------------------------------+
| top_codes                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------+
| CHN,IND,KAZ,SAU,IDN,RUS,UKR,FRA,ESP,SWE,CAN,USA,GRL,MEX,NIC,SDN,DZA,COD,LBY,TCD,AUS,PNG,NZL,SLB,NCL,ATA,ATF,SGS,HMD,BVT,BRA,ARG,PER,COL,BOL |
+---------------------------------------------------------------------------------------------------------------------------------------------+

And the final query simply demands that Code must be found within this string, by calling upon FIND_IN_SET(Code, top_codes).

Notes

This solution works for PRIMARY KEYs or otherwise UNIQUE KEYs of all sorts (a CHAR(3) in our example, but same for integers etc.)
And you still have to have a sufficient group_concat_max_len (see this post). You must have a large enough value to fit in the very long text you may be expecting in s_top_codes.
Performance-wise there are full scans here, as well as string searching.

* UPDATE

I should pay closer attention. This comment had it 5 years ago.

10 thoughts on “SQL: selecting top N records per group, another solution”

Øystein Grøvlen says:

August 21, 2012 at 8:57 am

Very nice, but the population numbers do not look right …
shlomi says:

August 21, 2012 at 9:05 am

@Øystein,

Thanks – fixed in text. Used a playground database with modified values. After re-importing world.sql population values are now correct.
Rick James says:

August 21, 2012 at 7:35 pm

The solution may have trouble with integers…
123,234,1579,…
“157” and “579” will match that, probably incorrectly.

Other approaches are found here:
http://forums.mysql.com/read.php?20,537167,537521
and
http://www.artfulsoftware.com/queries.php

There used to be a way to do the “top N” with nested queries, but MariaDB (at least) has optimized away the “feature” it depended on.
shlomi says:

August 21, 2012 at 9:29 pm

@Rick,

No, no problem with these integers. See FIND_IN_SET.

It’s not a LIKE ‘%157%’ expression. It looks for comma delimited values.

I personally do not like queries which require me to create new tables and INSERT rows onto them. Such a solution is valid, of course, but I prefer to avoid writing. This allows me better and stricter privileges rules for users, among other issues.
Pingback: Three wishes for a new year | code.openark.org