SQL: selecting top N records per group, another solution

August 21, 2012

A while back I presented SQL: selecting top N records per group, a "give me the top 5 countries in each continent" type of query, and which used an external numbers table and a lot of tedious casting.

Here's another solution I came up with (*). Still using GROUP_CONCAT (how else?), but no external table and no casting. The query outputs the largest 5 countries (by surface area) per continent.

SELECT
  Continent,
  Name,
  SurfaceArea,
  Population
FROM
  world.Country,
  (
    SELECT 
      GROUP_CONCAT(top_codes_per_group) AS top_codes
    FROM
      (
        SELECT 
          SUBSTRING_INDEX(GROUP_CONCAT(Code ORDER BY SurfaceArea DESC), ',', 5) AS top_codes_per_group
        FROM
          world.Country
        GROUP BY
          Continent
      ) s_top_codes_per_group
  ) s_top_codes
WHERE
  FIND_IN_SET(Code, top_codes)
ORDER BY
  Continent,
  SurfaceArea DESC
;

+---------------+----------------------------------------------+-------------+------------+
| Continent     | Name                                         | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia          | China                                        |  9572900.00 | 1277558000 |
| Asia          | India                                        |  3287263.00 | 1013662000 |
| Asia          | Kazakstan                                    |  2724900.00 |   16223000 |
| Asia          | Saudi Arabia                                 |  2149690.00 |   21607000 |
| Asia          | Indonesia                                    |  1904569.00 |  212107000 |
| Europe        | Russian Federation                           | 17075400.00 |  146934000 |
| Europe        | Ukraine                                      |   603700.00 |   50456000 |
| Europe        | France                                       |   551500.00 |   59225700 |
| Europe        | Spain                                        |   505992.00 |   39441700 |
| Europe        | Sweden                                       |   449964.00 |    8861400 |
| North America | Canada                                       |  9970610.00 |   31147000 |
| North America | United States                                |  9363520.00 |  278357000 |
| North America | Greenland                                    |  2166090.00 |      56000 |
| North America | Mexico                                       |  1958201.00 |   98881000 |
| North America | Nicaragua                                    |   130000.00 |    5074000 |
| Africa        | Sudan                                        |  2505813.00 |   29490000 |
| Africa        | Algeria                                      |  2381741.00 |   31471000 |
| Africa        | Congo, The Democratic Republic of the        |  2344858.00 |   51654000 |
| Africa        | Libyan Arab Jamahiriya                       |  1759540.00 |    5605000 |
| Africa        | Chad                                         |  1284000.00 |    7651000 |
| Oceania       | Australia                                    |  7741220.00 |   18886000 |
| Oceania       | Papua New Guinea                             |   462840.00 |    4807000 |
| Oceania       | New Zealand                                  |   270534.00 |    3862000 |
| Oceania       | Solomon Islands                              |    28896.00 |     444000 |
| Oceania       | New Caledonia                                |    18575.00 |     214000 |
| Antarctica    | Antarctica                                   | 13120000.00 |          0 |
| Antarctica    | French Southern territories                  |     7780.00 |          0 |
| Antarctica    | South Georgia and the South Sandwich Islands |     3903.00 |          0 |
| Antarctica    | Heard Island and McDonald Islands            |      359.00 |          0 |
| Antarctica    | Bouvet Island                                |       59.00 |          0 |
| South America | Brazil                                       |  8547403.00 |  170115000 |
| South America | Argentina                                    |  2780400.00 |   37032000 |
| South America | Peru                                         |  1285216.00 |   25662000 |
| South America | Colombia                                     |  1138914.00 |   42321000 |
| South America | Bolivia                                      |  1098581.00 |    8329000 |
+---------------+----------------------------------------------+-------------+------------+

In bold are the conditions by which we nominate our selected rows (condition is SurfaceArea DESC, number of rows is 5, so 5 largest countries).

What's going on here?

So the inner s_top_codes_per_group query produces the codes for largest countries per continent:

+---------------------+
| top_codes_per_group |
+---------------------+
| CHN,IND,KAZ,SAU,IDN |
| RUS,UKR,FRA,ESP,SWE |
| CAN,USA,GRL,MEX,NIC |
| SDN,DZA,COD,LBY,TCD |
| AUS,PNG,NZL,SLB,NCL |
| ATA,ATF,SGS,HMD,BVT |
| BRA,ARG,PER,COL,BOL |
+---------------------+

The wrapping s_top_codes query concatenates all the above to one long text:

+---------------------------------------------------------------------------------------------------------------------------------------------+
| top_codes                                                                                                                                   |
+---------------------------------------------------------------------------------------------------------------------------------------------+
| CHN,IND,KAZ,SAU,IDN,RUS,UKR,FRA,ESP,SWE,CAN,USA,GRL,MEX,NIC,SDN,DZA,COD,LBY,TCD,AUS,PNG,NZL,SLB,NCL,ATA,ATF,SGS,HMD,BVT,BRA,ARG,PER,COL,BOL |
+---------------------------------------------------------------------------------------------------------------------------------------------+

And the final query simply demands that Code must be found within this string, by calling upon FIND_IN_SET(Code, top_codes).

Notes

  • This solution works for PRIMARY KEYs or otherwise UNIQUE KEYs of all sorts (a CHAR(3) in our example, but same for integers etc.)
  • And you still have to have a sufficient group_concat_max_len (see this post). You must have a large enough value to fit in the very long text you may be expecting in s_top_codes.
  • Performance-wise there are full scans here, as well as string searching.

* UPDATE

I should pay closer attention. This comment had it 5 years ago.

tags: ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

9 Comments to "SQL: selecting top N records per group, another solution"

  1. Øystein Grøvlen wrote:

    Very nice, but the population numbers do not look right ...

  2. shlomi wrote:

    @Øystein,

    Thanks - fixed in text. Used a playground database with modified values. After re-importing world.sql population values are now correct.

  3. Rick James wrote:

    The solution may have trouble with integers...
    123,234,1579,...
    "157" and "579" will match that, probably incorrectly.

    Other approaches are found here:
    http://forums.mysql.com/read.php?20,537167,537521
    and
    http://www.artfulsoftware.com/queries.php

    There used to be a way to do the "top N" with nested queries, but MariaDB (at least) has optimized away the "feature" it depended on.

  4. shlomi wrote:

    @Rick,

    No, no problem with these integers. See FIND_IN_SET.

    It's not a LIKE '%157%' expression. It looks for comma delimited values.

    I personally do not like queries which require me to create new tables and INSERT rows onto them. Such a solution is valid, of course, but I prefer to avoid writing. This allows me better and stricter privileges rules for users, among other issues.

  5. Three wishes for a new year | code.openark.org wrote:

    [...] I like GROUP_CONCAT, and the many hacks it provides: [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. But it makes for a poor substitution to Window Functions, and only solves a subset of [...]

  6. stephen wrote:

    is there a way to do a cumulative sum over the window? for example show cumsum of population over each continent sorting countries by name. a more natural application is cumsum over sales rep sorted by time of sale. eg translate into MySQL this Netezza synatx:
    select
    sum(col1) over
    (partition by col2, col3 order by col4 rows between unbounded
    preceding and current row) as sum1 from table1 ;

  7. shlomi wrote:

    @stephen, not really. As you know mysql does not have window functions. Mimicking the above has no immediate solution.

  8. shlomi wrote:

    Well, you could write a sub select per such row. I once compiled some window functions alternatives. Need to pull it from wherever it is

  9. Rick James wrote:

    The GROUP_CONCAT trick has a caveat -- there is a default limit of 1024 bytes in the string. (That limit can be changed.)

    Another approach is given here:
    https://mariadb.com/kb/en/group-by-trick-has-been-optimized-away/
    However, for MariaDB, one must do this to keep the optimizer from doing too good a job:
    set @@session.optimizer_switch='derived_merge=off,derived_with_keys=off';

Leave Your Comment

 

 
Powered by Wordpress and MySQL. Theme by openark.org