Selecting a specific non aggregated column data in GROUP BY

December 1, 2008

In a GROUP BY query, MySQL may allow specifying non aggregated columns. For example, using MySQL's world database, I wish to get the number of countries per continent, along with a "sample" country:

SELECT Continent, COUNT(*), Name
FROM `Country` GROUP BY Continent

+---------------+----------+----------------+
| Continent     | COUNT(*) | Name           |
+---------------+----------+----------------+
| Asia          |       51 | Afghanistan    |
| Europe        |       46 | Albania        |
| North America |       37 | Aruba          |
| Africa        |       58 | Angola         |
| Oceania       |       28 | American Samoa |
| Antarctica    |        5 | Antarctica     |
| South America |       14 | Argentina      |
+---------------+----------+----------------+

What if I want to choose that "sample" country? For example, for each continent, I wish to show the country with the largest population. To simply see the largest population, I would use MAX(Population). But which country is referred? I wish to provide a solution which does not involve sub-queries, HAVING or JOINs.

We begin by observing MySQL's GROUP_CONCAT() fcuntion:

SELECT Continent, COUNT(*), GROUP_CONCAT(Name)
FROM `Country` GROUP BY Continent

+---------------+----------+-------------------------------+
| Continent     | COUNT(*) | GROUP_CONCAT(Name)            |
+---------------+----------+-------------------------------+
| Asia          |       51 | Sri Lanka,South Korea,Mald... |
| Europe        |       46 | Iceland,Faroe Islands,Fran... |
| North America |       37 | Martinique,Mexico,Puerto R... |
| Africa        |       58 | Zambia,South Africa,Nigeri... |
| Oceania       |       28 | Tokelau,American Samoa,Sam... |
| Antarctica    |        5 | South Georgia and the Sout... |
| South America |       14 | Suriname,Argentina,Ecuador... |
+---------------+----------+-------------------------------+

Mmmm. The order by which the names are presented does not seem to be of use to me. Didn't we say we want to see the country with the largest population? It just so happens that GROUP_CONCAT allows for ORDER BY:

SELECT Continent, COUNT(*),
  GROUP_CONCAT(Name ORDER BY Population DESC)
FROM `Country` GROUP BY Continent

+---------------+----------+-------------------------------------------------------+
| Continent     | COUNT(*) | GROUP_CONCAT(Name ORDER BY Population DESC)           |
+---------------+----------+-------------------------------------------------------+
| Asia          |       51 | China,India,Indonesia,Pakistan,Bangladesh,Japan,Vi... |
| Europe        |       46 | Russian Federation,Germany,United Kingdom,France,I... |
| North America |       37 | United States,Mexico,Canada,Guatemala,Cuba,Dominic... |
| Africa        |       58 | Nigeria,Egypt,Ethiopia,Congo, The Democratic Repub... |
| Oceania       |       28 | Australia,Papua New Guinea,New Zealand,Fiji Island... |
| Antarctica    |        5 | Heard Island and McDonald Islands,Antarctica,Bouve... |
| South America |       14 | Brazil,Colombia,Argentina,Peru,Venezuela,Chile,Ecu... |
+---------------+----------+-------------------------------------------------------+

Good progress! We can now see our desired countries first in list. What's next? We want to isolate the first country for each continent. SUBSTRING_INDEX() comes to the rescue. Finally, we reach the following query:

SELECT Continent, COUNT(*),
  SUBSTRING_INDEX(
    CONCAT(
      GROUP_CONCAT(Name ORDER BY Population DESC),
      ','),
    ',', 1) AS largest_country
FROM `Country` GROUP BY Continent

+---------------+----------+-----------------------------------+
| Continent     | COUNT(*) | largest_country                   |
+---------------+----------+-----------------------------------+
| Asia          |       51 | China                             |
| Europe        |       46 | Russian Federation                |
| North America |       37 | United States                     |
| Africa        |       58 | Nigeria                           |
| Oceania       |       28 | Australia                         |
| Antarctica    |        5 | Heard Island and McDonald Islands |
| South America |       14 | Brazil                            |
+---------------+----------+-----------------------------------+

This query provides us with the right answer. No sub-selects, no joins required. Granted, it's not too pretty, and it's not clean SQL (it's not even standard SQL) - but it just may save some query time: there's an index to add which will cause this query to run "using index":

ALTER TABLE Country ADD INDEX `Continent_Name_Population` (`Continent`,`name`,`Population`)

Notes

  • The above assumes the comma (',') does not appear in the text of the column. Otherwise, you can choose your own delimiter in GROUP_CONCAT().
  • The concatenated string can grow long if many rows are aggregated per group. Check out group_concat_max_len, which limits the length of the GROUP_CONCAT string. Also see max_allowed_packet, or read the whole deal on the MySQL online documentation.

[UPDATE: Turns out an anonymous comment to Kristian Nielsen's post suggested the same solution, about a week ago. Moreover, it suggests that the CONCAT() part in the above solution is not required, as SUBSTRING_INDEX does not require the delimiter to be present]

tags: ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

11 Comments to "Selecting a specific non aggregated column data in GROUP BY"

  1. Giuseppe Maxia wrote:

    Cool hack!
    Kudos

    Giuseppe

  2. Toby wrote:

    Why are you afraid to use a JOIN here?

  3. shlomi wrote:

    Hi Toby,

    This post presents an alternative which is very cheap with the right index, possibly cheaper than a JOIN alternative.

    Shlomi

  4. Toby wrote:

    'Possibly' cheaper? This may be no more than premature obfuscation, then.

  5. shlomi wrote:

    Toby,

    Not so. It was cheaper on my testing benchmarks. I am careful to say "possible" because I cannot guarantee it will be cheaper to use this index on *any* table in *any* database under *any* circumstance, is all.

    Shlomi

  6. Roland Bouman wrote:

    Hi!

    "UPDATE: Turns out an anonymous comment to Kristian Nielsen's post suggested the same solution, about a week ago."

    Yeah, it's mine. I try to sign those dreadful blog interfaces that require me to get an account, but apparently I forgot that time.

    BTW - I made a mistake in that particular example, as I forgot to include a DESC for the ORDER BY (the problem was selecting the last version - not the first)

    Roland Bouman
    http://rpbouman.blogspot.com/

  7. Roland Bouman wrote:

    Hi!

    here's another trick that uses the same device to calculate the median:

    http://rpbouman.blogspot.com/2007/12/calculating-financial-median-in-mysql.html

  8. shlomi wrote:

    Giuseppe - thanks!

    Roland - good to see you here again, was a bit discouraged to have found your comment over there... Wasn't stealing though :)
    Thanks for the median query link - nice one!

  9. Roland Bouman wrote:

    Shlomi: don't be discouraged. It's not about who's first. I am pretty sure someone else figured this one out before me for that matter. They'd have to I guess, I mean, in the early days MySQL didn't have subqueries and GROUP_CONCAT is an excellent device to work around subquery problems (in fact even now some queries that i'd rather solve with a subquery are often solved with better performance using these hacks)

    Personally, I feel the hackishness of GROUP_CONCAT will be fulfilled when it gets a LIMIT clause. It may sound crazy but the idea is that it will make things like the groupwise MAX much more efficient if no time is spent building a large string concatenating all the entries - you could just do LIMIT 1 and save a lot of memory and avoid a warning about group_concat_max_len too.

  10. SQL: selecting top N records per group | code.openark.org wrote:

    [...] Selecting a specific non aggregated column data in GROUP BY [...]

  11. guangnan wrote:

    Smart. Thank you.

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org