In a GROUP BY query, MySQL may allow specifying non aggregated columns. For example, using MySQL’s world database, I wish to get the number of countries per continent, along with a “sample” country:
SELECT Continent, COUNT(*), Name FROM `Country` GROUP BY Continent +---------------+----------+----------------+ | Continent | COUNT(*) | Name | +---------------+----------+----------------+ | Asia | 51 | Afghanistan | | Europe | 46 | Albania | | North America | 37 | Aruba | | Africa | 58 | Angola | | Oceania | 28 | American Samoa | | Antarctica | 5 | Antarctica | | South America | 14 | Argentina | +---------------+----------+----------------+
What if I want to choose that “sample” country? For example, for each continent, I wish to show the country with the largest population. To simply see the largest population, I would use MAX(Population). But which country is referred? I wish to provide a solution which does not involve sub-queries, HAVING or JOINs.
We begin by observing MySQL’s GROUP_CONCAT() fcuntion:
SELECT Continent, COUNT(*), GROUP_CONCAT(Name) FROM `Country` GROUP BY Continent +---------------+----------+-------------------------------+ | Continent | COUNT(*) | GROUP_CONCAT(Name) | +---------------+----------+-------------------------------+ | Asia | 51 | Sri Lanka,South Korea,Mald... | | Europe | 46 | Iceland,Faroe Islands,Fran... | | North America | 37 | Martinique,Mexico,Puerto R... | | Africa | 58 | Zambia,South Africa,Nigeri... | | Oceania | 28 | Tokelau,American Samoa,Sam... | | Antarctica | 5 | South Georgia and the Sout... | | South America | 14 | Suriname,Argentina,Ecuador... | +---------------+----------+-------------------------------+
Mmmm. The order by which the names are presented does not seem to be of use to me. Didn’t we say we want to see the country with the largest population? It just so happens that GROUP_CONCAT allows for ORDER BY:
SELECT Continent, COUNT(*), GROUP_CONCAT(Name ORDER BY Population DESC) FROM `Country` GROUP BY Continent +---------------+----------+-------------------------------------------------------+ | Continent | COUNT(*) | GROUP_CONCAT(Name ORDER BY Population DESC) | +---------------+----------+-------------------------------------------------------+ | Asia | 51 | China,India,Indonesia,Pakistan,Bangladesh,Japan,Vi... | | Europe | 46 | Russian Federation,Germany,United Kingdom,France,I... | | North America | 37 | United States,Mexico,Canada,Guatemala,Cuba,Dominic... | | Africa | 58 | Nigeria,Egypt,Ethiopia,Congo, The Democratic Repub... | | Oceania | 28 | Australia,Papua New Guinea,New Zealand,Fiji Island... | | Antarctica | 5 | Heard Island and McDonald Islands,Antarctica,Bouve... | | South America | 14 | Brazil,Colombia,Argentina,Peru,Venezuela,Chile,Ecu... | +---------------+----------+-------------------------------------------------------+
Good progress! We can now see our desired countries first in list. What’s next? We want to isolate the first country for each continent. SUBSTRING_INDEX() comes to the rescue. Finally, we reach the following query:
SELECT Continent, COUNT(*), SUBSTRING_INDEX( CONCAT( GROUP_CONCAT(Name ORDER BY Population DESC), ','), ',', 1) AS largest_country FROM `Country` GROUP BY Continent +---------------+----------+-----------------------------------+ | Continent | COUNT(*) | largest_country | +---------------+----------+-----------------------------------+ | Asia | 51 | China | | Europe | 46 | Russian Federation | | North America | 37 | United States | | Africa | 58 | Nigeria | | Oceania | 28 | Australia | | Antarctica | 5 | Heard Island and McDonald Islands | | South America | 14 | Brazil | +---------------+----------+-----------------------------------+
This query provides us with the right answer. No sub-selects, no joins required. Granted, it’s not too pretty, and it’s not clean SQL (it’s not even standard SQL) – but it just may save some query time: there’s an index to add which will cause this query to run “using index”:
ALTER TABLE Country ADD INDEX `Continent_Name_Population` (`Continent`,`name`,`Population`)
Notes
- The above assumes the comma (‘,’) does not appear in the text of the column. Otherwise, you can choose your own delimiter in GROUP_CONCAT().
- The concatenated string can grow long if many rows are aggregated per group. Check out group_concat_max_len, which limits the length of the GROUP_CONCAT string. Also see
max_allowed_packet
, or read the whole deal on the MySQL online documentation.
[UPDATE: Turns out an anonymous comment to Kristian Nielsen’s post suggested the same solution, about a week ago. Moreover, it suggests that the CONCAT() part in the above solution is not required, as SUBSTRING_INDEX does not require the delimiter to be present]
Cool hack!
Kudos
Giuseppe
Why are you afraid to use a JOIN here?
Hi Toby,
This post presents an alternative which is very cheap with the right index, possibly cheaper than a JOIN alternative.
Shlomi
‘Possibly’ cheaper? This may be no more than premature obfuscation, then.
Toby,
Not so. It was cheaper on my testing benchmarks. I am careful to say “possible” because I cannot guarantee it will be cheaper to use this index on *any* table in *any* database under *any* circumstance, is all.
Shlomi