Selecting a specific non aggregated column data in GROUP BY

In a GROUP BY query, MySQL may allow specifying non aggregated columns. For example, using MySQL’s world database, I wish to get the number of countries per continent, along with a “sample” country:

SELECT Continent, COUNT(*), Name
FROM `Country` GROUP BY Continent

+---------------+----------+----------------+
| Continent     | COUNT(*) | Name           |
+---------------+----------+----------------+
| Asia          |       51 | Afghanistan    |
| Europe        |       46 | Albania        |
| North America |       37 | Aruba          |
| Africa        |       58 | Angola         |
| Oceania       |       28 | American Samoa |
| Antarctica    |        5 | Antarctica     |
| South America |       14 | Argentina      |
+---------------+----------+----------------+

What if I want to choose that “sample” country? For example, for each continent, I wish to show the country with the largest population. To simply see the largest population, I would use MAX(Population). But which country is referred? I wish to provide a solution which does not involve sub-queries, HAVING or JOINs.

We begin by observing MySQL’s GROUP_CONCAT() fcuntion:

SELECT Continent, COUNT(*), GROUP_CONCAT(Name)
FROM `Country` GROUP BY Continent

+---------------+----------+-------------------------------+
| Continent     | COUNT(*) | GROUP_CONCAT(Name)            |
+---------------+----------+-------------------------------+
| Asia          |       51 | Sri Lanka,South Korea,Mald... |
| Europe        |       46 | Iceland,Faroe Islands,Fran... |
| North America |       37 | Martinique,Mexico,Puerto R... |
| Africa        |       58 | Zambia,South Africa,Nigeri... |
| Oceania       |       28 | Tokelau,American Samoa,Sam... |
| Antarctica    |        5 | South Georgia and the Sout... |
| South America |       14 | Suriname,Argentina,Ecuador... |
+---------------+----------+-------------------------------+

Mmmm. The order by which the names are presented does not seem to be of use to me. Didn’t we say we want to see the country with the largest population? It just so happens that GROUP_CONCAT allows for ORDER BY:

SELECT Continent, COUNT(*),
  GROUP_CONCAT(Name ORDER BY Population DESC)
FROM `Country` GROUP BY Continent

+---------------+----------+-------------------------------------------------------+
| Continent     | COUNT(*) | GROUP_CONCAT(Name ORDER BY Population DESC)           |
+---------------+----------+-------------------------------------------------------+
| Asia          |       51 | China,India,Indonesia,Pakistan,Bangladesh,Japan,Vi... |
| Europe        |       46 | Russian Federation,Germany,United Kingdom,France,I... |
| North America |       37 | United States,Mexico,Canada,Guatemala,Cuba,Dominic... |
| Africa        |       58 | Nigeria,Egypt,Ethiopia,Congo, The Democratic Repub... |
| Oceania       |       28 | Australia,Papua New Guinea,New Zealand,Fiji Island... |
| Antarctica    |        5 | Heard Island and McDonald Islands,Antarctica,Bouve... |
| South America |       14 | Brazil,Colombia,Argentina,Peru,Venezuela,Chile,Ecu... |
+---------------+----------+-------------------------------------------------------+

Good progress! We can now see our desired countries first in list. What’s next? We want to isolate the first country for each continent. SUBSTRING_INDEX() comes to the rescue. Finally, we reach the following query:

SELECT Continent, COUNT(*),
  SUBSTRING_INDEX(
    CONCAT(
      GROUP_CONCAT(Name ORDER BY Population DESC),
      ','),
    ',', 1) AS largest_country
FROM `Country` GROUP BY Continent

+---------------+----------+-----------------------------------+
| Continent     | COUNT(*) | largest_country                   |
+---------------+----------+-----------------------------------+
| Asia          |       51 | China                             |
| Europe        |       46 | Russian Federation                |
| North America |       37 | United States                     |
| Africa        |       58 | Nigeria                           |
| Oceania       |       28 | Australia                         |
| Antarctica    |        5 | Heard Island and McDonald Islands |
| South America |       14 | Brazil                            |
+---------------+----------+-----------------------------------+

This query provides us with the right answer. No sub-selects, no joins required. Granted, it’s not too pretty, and it’s not clean SQL (it’s not even standard SQL) – but it just may save some query time: there’s an index to add which will cause this query to run “using index”:

ALTER TABLE Country ADD INDEX `Continent_Name_Population` (`Continent`,`name`,`Population`)

Notes

  • The above assumes the comma (‘,’) does not appear in the text of the column. Otherwise, you can choose your own delimiter in GROUP_CONCAT().
  • The concatenated string can grow long if many rows are aggregated per group. Check out group_concat_max_len, which limits the length of the GROUP_CONCAT string. Also see max_allowed_packet, or read the whole deal on the MySQL online documentation.

[UPDATE: Turns out an anonymous comment to Kristian Nielsen’s post suggested the same solution, about a week ago. Moreover, it suggests that the CONCAT() part in the above solution is not required, as SUBSTRING_INDEX does not require the delimiter to be present]

12
Leave a Reply

avatar
12 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
5 Comment authors
guangnanRoland BoumanshlomiTobyGiuseppe Maxia Recent comment authors

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
Notify of
Giuseppe Maxia
Guest

Cool hack!
Kudos

Giuseppe

Toby
Guest

Why are you afraid to use a JOIN here?

Toby
Guest

‘Possibly’ cheaper? This may be no more than premature obfuscation, then.

Roland Bouman
Guest

Hi!

“UPDATE: Turns out an anonymous comment to Kristian Nielsen’s post suggested the same solution, about a week ago.”

Yeah, it’s mine. I try to sign those dreadful blog interfaces that require me to get an account, but apparently I forgot that time.

BTW – I made a mistake in that particular example, as I forgot to include a DESC for the ORDER BY (the problem was selecting the last version – not the first)

Roland Bouman
http://rpbouman.blogspot.com/

Roland Bouman
Guest

Hi!

here’s another trick that uses the same device to calculate the median:

http://rpbouman.blogspot.com/2007/12/calculating-financial-median-in-mysql.html

Roland Bouman
Guest

Shlomi: don’t be discouraged. It’s not about who’s first. I am pretty sure someone else figured this one out before me for that matter. They’d have to I guess, I mean, in the early days MySQL didn’t have subqueries and GROUP_CONCAT is an excellent device to work around subquery problems (in fact even now some queries that i’d rather solve with a subquery are often solved with better performance using these hacks) Personally, I feel the hackishness of GROUP_CONCAT will be fulfilled when it gets a LIMIT clause. It may sound crazy but the idea is that it will make… Read more »

trackback

[…] Selecting a specific non aggregated column data in GROUP BY […]

guangnan
Guest
guangnan

Smart. Thank you.

trackback

[…] See also my old post: Selecting a specific non aggregated column data in GROUP BY […]