In a GROUP BY query, MySQL may allow specifying non aggregated columns. For example, using MySQL’s world database, I wish to get the number of countries per continent, along with a “sample” country:
SELECT Continent, COUNT(*), Name FROM `Country` GROUP BY Continent +---------------+----------+----------------+ | Continent | COUNT(*) | Name | +---------------+----------+----------------+ | Asia | 51 | Afghanistan | | Europe | 46 | Albania | | North America | 37 | Aruba | | Africa | 58 | Angola | | Oceania | 28 | American Samoa | | Antarctica | 5 | Antarctica | | South America | 14 | Argentina | +---------------+----------+----------------+
What if I want to choose that “sample” country? For example, for each continent, I wish to show the country with the largest population. To simply see the largest population, I would use MAX(Population). But which country is referred? I wish to provide a solution which does not involve sub-queries, HAVING or JOINs.
We begin by observing MySQL’s GROUP_CONCAT() fcuntion:
SELECT Continent, COUNT(*), GROUP_CONCAT(Name) FROM `Country` GROUP BY Continent +---------------+----------+-------------------------------+ | Continent | COUNT(*) | GROUP_CONCAT(Name) | +---------------+----------+-------------------------------+ | Asia | 51 | Sri Lanka,South Korea,Mald... | | Europe | 46 | Iceland,Faroe Islands,Fran... | | North America | 37 | Martinique,Mexico,Puerto R... | | Africa | 58 | Zambia,South Africa,Nigeri... | | Oceania | 28 | Tokelau,American Samoa,Sam... | | Antarctica | 5 | South Georgia and the Sout... | | South America | 14 | Suriname,Argentina,Ecuador... | +---------------+----------+-------------------------------+
Mmmm. The order by which the names are presented does not seem to be of use to me. Didn’t we say we want to see the country with the largest population? It just so happens that GROUP_CONCAT allows for ORDER BY:
SELECT Continent, COUNT(*), GROUP_CONCAT(Name ORDER BY Population DESC) FROM `Country` GROUP BY Continent +---------------+----------+-------------------------------------------------------+ | Continent | COUNT(*) | GROUP_CONCAT(Name ORDER BY Population DESC) | +---------------+----------+-------------------------------------------------------+ | Asia | 51 | China,India,Indonesia,Pakistan,Bangladesh,Japan,Vi... | | Europe | 46 | Russian Federation,Germany,United Kingdom,France,I... | | North America | 37 | United States,Mexico,Canada,Guatemala,Cuba,Dominic... | | Africa | 58 | Nigeria,Egypt,Ethiopia,Congo, The Democratic Repub... | | Oceania | 28 | Australia,Papua New Guinea,New Zealand,Fiji Island... | | Antarctica | 5 | Heard Island and McDonald Islands,Antarctica,Bouve... | | South America | 14 | Brazil,Colombia,Argentina,Peru,Venezuela,Chile,Ecu... | +---------------+----------+-------------------------------------------------------+
Good progress! We can now see our desired countries first in list. What’s next? We want to isolate the first country for each continent. SUBSTRING_INDEX() comes to the rescue. Finally, we reach the following query:
SELECT Continent, COUNT(*),
SUBSTRING_INDEX(
CONCAT(
GROUP_CONCAT(Name ORDER BY Population DESC),
','),
',', 1) AS largest_country
FROM `Country` GROUP BY Continent
+---------------+----------+-----------------------------------+
| Continent | COUNT(*) | largest_country |
+---------------+----------+-----------------------------------+
| Asia | 51 | China |
| Europe | 46 | Russian Federation |
| North America | 37 | United States |
| Africa | 58 | Nigeria |
| Oceania | 28 | Australia |
| Antarctica | 5 | Heard Island and McDonald Islands |
| South America | 14 | Brazil |
+---------------+----------+-----------------------------------+
This query provides us with the right answer. No sub-selects, no joins required. Granted, it’s not too pretty, and it’s not clean SQL (it’s not even standard SQL) – but it just may save some query time: there’s an index to add which will cause this query to run “using index”:
ALTER TABLE Country ADD INDEX `Continent_Name_Population` (`Continent`,`name`,`Population`)
Notes
- The above assumes the comma (‘,’) does not appear in the text of the column. Otherwise, you can choose your own delimiter in GROUP_CONCAT().
- The concatenated string can grow long if many rows are aggregated per group. Check out group_concat_max_len, which limits the length of the GROUP_CONCAT string. Also see
max_allowed_packet, or read the whole deal on the MySQL online documentation.
[UPDATE: Turns out an anonymous comment to Kristian Nielsen’s post suggested the same solution, about a week ago. Moreover, it suggests that the CONCAT() part in the above solution is not required, as SUBSTRING_INDEX does not require the delimiter to be present]
Cool hack!
Kudos
Giuseppe
Why are you afraid to use a JOIN here?
Hi Toby,
This post presents an alternative which is very cheap with the right index, possibly cheaper than a JOIN alternative.
Shlomi
‘Possibly’ cheaper? This may be no more than premature obfuscation, then.
Toby,
Not so. It was cheaper on my testing benchmarks. I am careful to say “possible” because I cannot guarantee it will be cheaper to use this index on *any* table in *any* database under *any* circumstance, is all.
Shlomi
Hi!
“UPDATE: Turns out an anonymous comment to Kristian Nielsen’s post suggested the same solution, about a week ago.”
Yeah, it’s mine. I try to sign those dreadful blog interfaces that require me to get an account, but apparently I forgot that time.
BTW – I made a mistake in that particular example, as I forgot to include a DESC for the ORDER BY (the problem was selecting the last version – not the first)
Roland Bouman
http://rpbouman.blogspot.com/
Hi!
here’s another trick that uses the same device to calculate the median:
http://rpbouman.blogspot.com/2007/12/calculating-financial-median-in-mysql.html
Giuseppe – thanks!
Roland – good to see you here again, was a bit discouraged to have found your comment over there… Wasn’t stealing though 🙂
Thanks for the median query link – nice one!
Shlomi: don’t be discouraged. It’s not about who’s first. I am pretty sure someone else figured this one out before me for that matter. They’d have to I guess, I mean, in the early days MySQL didn’t have subqueries and GROUP_CONCAT is an excellent device to work around subquery problems (in fact even now some queries that i’d rather solve with a subquery are often solved with better performance using these hacks) Personally, I feel the hackishness of GROUP_CONCAT will be fulfilled when it gets a LIMIT clause. It may sound crazy but the idea is that it will make… Read more »
[…] Selecting a specific non aggregated column data in GROUP BY […]
Smart. Thank you.
[…] See also my old post: Selecting a specific non aggregated column data in GROUP BY […]