Selecting a specific non aggregated column data in GROUP BY

In a GROUP BY query, MySQL may allow specifying non aggregated columns. For example, using MySQL’s world database, I wish to get the number of countries per continent, along with a “sample” country:

SELECT Continent, COUNT(*), Name
FROM `Country` GROUP BY Continent

+---------------+----------+----------------+
| Continent     | COUNT(*) | Name           |
+---------------+----------+----------------+
| Asia          |       51 | Afghanistan    |
| Europe        |       46 | Albania        |
| North America |       37 | Aruba          |
| Africa        |       58 | Angola         |
| Oceania       |       28 | American Samoa |
| Antarctica    |        5 | Antarctica     |
| South America |       14 | Argentina      |
+---------------+----------+----------------+

What if I want to choose that “sample” country? For example, for each continent, I wish to show the country with the largest population. To simply see the largest population, I would use MAX(Population). But which country is referred? I wish to provide a solution which does not involve sub-queries, HAVING or JOINs.

We begin by observing MySQL’s GROUP_CONCAT() fcuntion:

SELECT Continent, COUNT(*), GROUP_CONCAT(Name)
FROM `Country` GROUP BY Continent

+---------------+----------+-------------------------------+
| Continent     | COUNT(*) | GROUP_CONCAT(Name)            |
+---------------+----------+-------------------------------+
| Asia          |       51 | Sri Lanka,South Korea,Mald... |
| Europe        |       46 | Iceland,Faroe Islands,Fran... |
| North America |       37 | Martinique,Mexico,Puerto R... |
| Africa        |       58 | Zambia,South Africa,Nigeri... |
| Oceania       |       28 | Tokelau,American Samoa,Sam... |
| Antarctica    |        5 | South Georgia and the Sout... |
| South America |       14 | Suriname,Argentina,Ecuador... |
+---------------+----------+-------------------------------+

Mmmm. The order by which the names are presented does not seem to be of use to me. Didn’t we say we want to see the country with the largest population? It just so happens that GROUP_CONCAT allows for ORDER BY:

SELECT Continent, COUNT(*),
  GROUP_CONCAT(Name ORDER BY Population DESC)
FROM `Country` GROUP BY Continent

+---------------+----------+-------------------------------------------------------+
| Continent     | COUNT(*) | GROUP_CONCAT(Name ORDER BY Population DESC)           |
+---------------+----------+-------------------------------------------------------+
| Asia          |       51 | China,India,Indonesia,Pakistan,Bangladesh,Japan,Vi... |
| Europe        |       46 | Russian Federation,Germany,United Kingdom,France,I... |
| North America |       37 | United States,Mexico,Canada,Guatemala,Cuba,Dominic... |
| Africa        |       58 | Nigeria,Egypt,Ethiopia,Congo, The Democratic Repub... |
| Oceania       |       28 | Australia,Papua New Guinea,New Zealand,Fiji Island... |
| Antarctica    |        5 | Heard Island and McDonald Islands,Antarctica,Bouve... |
| South America |       14 | Brazil,Colombia,Argentina,Peru,Venezuela,Chile,Ecu... |
+---------------+----------+-------------------------------------------------------+

Good progress! We can now see our desired countries first in list. What’s next? We want to isolate the first country for each continent. SUBSTRING_INDEX() comes to the rescue. Finally, we reach the following query:

SELECT Continent, COUNT(*),
  SUBSTRING_INDEX(
    CONCAT(
      GROUP_CONCAT(Name ORDER BY Population DESC),
      ','),
    ',', 1) AS largest_country
FROM `Country` GROUP BY Continent

+---------------+----------+-----------------------------------+
| Continent     | COUNT(*) | largest_country                   |
+---------------+----------+-----------------------------------+
| Asia          |       51 | China                             |
| Europe        |       46 | Russian Federation                |
| North America |       37 | United States                     |
| Africa        |       58 | Nigeria                           |
| Oceania       |       28 | Australia                         |
| Antarctica    |        5 | Heard Island and McDonald Islands |
| South America |       14 | Brazil                            |
+---------------+----------+-----------------------------------+

This query provides us with the right answer. No sub-selects, no joins required. Granted, it’s not too pretty, and it’s not clean SQL (it’s not even standard SQL) – but it just may save some query time: there’s an index to add which will cause this query to run “using index”:

ALTER TABLE Country ADD INDEX `Continent_Name_Population` (`Continent`,`name`,`Population`)

Notes

The above assumes the comma (‘,’) does not appear in the text of the column. Otherwise, you can choose your own delimiter in GROUP_CONCAT().
The concatenated string can grow long if many rows are aggregated per group. Check out group_concat_max_len, which limits the length of the GROUP_CONCAT string. Also see max_allowed_packet, or read the whole deal on the MySQL online documentation.

[UPDATE: Turns out an anonymous comment to Kristian Nielsen’s post suggested the same solution, about a week ago. Moreover, it suggests that the CONCAT() part in the above solution is not required, as SUBSTRING_INDEX does not require the delimiter to be present]

12 thoughts on “Selecting a specific non aggregated column data in GROUP BY”

Giuseppe Maxia says:

December 1, 2008 at 6:51 pm

Cool hack!
Kudos

Giuseppe
Toby says:

December 1, 2008 at 7:06 pm

Why are you afraid to use a JOIN here?
shlomi says:

December 1, 2008 at 7:33 pm

Hi Toby,

This post presents an alternative which is very cheap with the right index, possibly cheaper than a JOIN alternative.

Shlomi
Toby says:

December 1, 2008 at 7:35 pm

‘Possibly’ cheaper? This may be no more than premature obfuscation, then.
shlomi says:

December 1, 2008 at 7:38 pm

Toby,

Not so. It was cheaper on my testing benchmarks. I am careful to say “possible” because I cannot guarantee it will be cheaper to use this index on *any* table in *any* database under *any* circumstance, is all.

Shlomi

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Notes

12 thoughts on “Selecting a specific non aggregated column data in GROUP BY”

Leave a Reply