MySQL’s character sets and collations are often considered as a mystery, and many users either completely disregard them and keep with the defaults, or set everything to UTF8.
This post will attempt to shed some light on the mystery, and provide with some best practices for use with text columns with regard to character sets.
Character Sets
A thorough discussion of how the character sets have evolved through history is beyond the scope of this post. While the Unicode standard is gaining recognition, the “older” character sets are still around. Understanding the difference between Unicode and local character sets is crucial.
Consider, for example, MySQL’s latin1
character set. In this character set there are 256 different characters, represented by one byte. The first 128 characters map to ASCII, the standard “ABCabc012 dot comma” set, of which most of this post is composed. The latter 128 characters in latin1
are composed of West European specific characters, such as À, ë, õ, Ñ.
A Name VARCHAR(60) CHARSET latin1
column can describe names with West European characters. But it cannot describe Russian or Hebrew names. To represent a name in Hebrew, you’d need the hebrew
charset (ISO 8859-8), in which the first 128 characters are, as always, mapped to ASCII, and the latter 128 characters describe the Hebrew alphabet and punctuation marks, such as ש,ל,מ,ה. The Cyrillic, Arabic and Turkish charsets follow in a similar manner.
Assume now I’m building a world wide web application, such as a popular social network. I would like to store the first names of my users, in every possible language. None of the above character sets support all languages. I therefore turn to Unicode. In particular, MySQL supports utf8
, a Unicode encoding scheme, which is commonly used due to its economic storage requirements.
In Unicode there is a dedicated number for each letter in the known languages, in ancient languages, and some imaginary or otherwise non existing languages, such as Klingon (yes, I know there are people who actually speak Klingon), may yet find their way into the standard.
UTF8 (or utf8), a Unicode encoding scheme, states the following: for ASCII characters, such as ‘a’, ‘6’, ‘$’, only one byte of storage is required. For Hebrew, Cyrillic or Turkish characters, 2 bytes are required. For Japanese, Chinese – more (MySQL supports up to 3 bytes per character). Again, the exact details of the implementation are beyond the scope of this post, and are well described here and here.
What’s important to me is that I can define Name VARCHAR(30) CHARSET utf8
for my columns, and Voila! Any name can be represented in my database.
So why not define everything as utf8 and get done with it?
Well, it just so happens that Unicode comes with a price. See, for example, the following column definition:
CountryCode CHAR(3) CHARSET utf8
We are asking for a column with 3 characters exactly. The required storage for this column will be such that any 3-letter name must fit in. This means (3 characters) times (3 bytes per character) = 9 bytes of storage. So CHAR
and utf8
together may be less than ideal. VARCHAR
behaves better: it only requires as many bytes per character as described above. So the text “abc” will only require 3 bytes (plus VARCHAR
‘s leading 1 or 2 bytes).
Why don’t we drop the ‘CHAR’ altogether, then, and use only ‘VARCHAR’?
Because some values are simply better represented with CHAR
: consider a “password” column, encoded with MD5. The MD5()
function returns a 32 characters long text. It’s always 32 characters, and, moreover, it’s always in ASCII. The best data type and character set definition would be password CHAR(32) CHARSET ascii
. We thus ensure exactly 32 bytes are allocated to this column. A VARCHAR
will acquire an additional byte or two, depending on its defined length, which will indicate the length of the text.
And why would I care about collations?
Collations deal with text comparison. We observed that the default character set in MySQL is latin1
. The default collation is latin1_swedish_ci
. In this collation the following holds true: 'ABC' = 'abc'
.
Wait. What?
Look at the “ci” in latin1_swedish_ci
. It stands for “case insensitive”. Collations which end with “cs” or “bin” are case sensitive. The utf8
character set comes with utf8_general_ci
collation. This can make sense. Let’s review our web application table (I’m using plain text passwords here, bare with me for this example):
CREATE TABLE my_users ( name VARCHAR(30) CHARSET utf8 COLLATE utf8_general_ci, plainPassword VARCHAR(16) CHARSET ASCII, UNIQUE KEY (name) ); INSERT INTO my_users (name, password) VALUES ('David', 'mypass');
It holds true that the name ‘David’ equals ‘david’. If I were to SELECT * FROM my_users WHERE name='david'
, I would find the desired row. The unique key will also guarantee that no daVID user can be added.
But David certainly wouldn’t want users to login with the password ‘MYPASS’. So we refine our table:
CREATE TABLE my_users ( name VARCHAR(30) CHARSET utf8 COLLATE utf8_general_ci, plainPassword VARCHAR(16) CHARSET ascii COLLATE ascii_bin, UNIQUE KEY (name) );
The ascii_bin
collation is a case sensitive collation for ascii
. The following will not find anything:
SELECT * FROM my_users WHERE name='david' AND plainPassword='MYPASS';
Holding a plain text password in your database is not a best practice, but apparently it’s common.
Collations also deal with text ordering. For any two strings, the collation determines which is larger, or if they are equal. Probably the most common situation you see collations in action is when you ORDER BY a text column.
Also keep in mind
- When you check for length of strings, do you use the
LENGTH()
function, as inSELECT LENGTH(Name) FROM City
? You probably wish to replace this withCHAR_LENGTH()
.LENGTH()
returns the number of bytes required for the text storage.CHAR_LENGTH()
returns the number of characters in the text, and is usually what you are looking for. It may hold true that for a string s,LENGTH(s)=12
andCHAR_LENGTH(s)=8
. Watch out for these glitches. - You can converts texts between character sets with
CONVERT
. For example:CONVERT(s USING utf8)
- Stored routines should not be overlooked. If your stored routine accepts a text argument, or if your stored function returns one, make sure the character sets are properly defined. If not, then your utf8 text may be converted to latin1 during the call to your stored routine. This also applies to local parameters within the stored routines.
- An
ALTER TABLE <some table> CONVERT TO <some charset>
will change the character set not only for the table itself, but also for all existing textual columns.
See the following post: Useful database analysis queries with INFORMATION_SCHEMA for queries which diagnose your databases character sets.
Very interesting article. Any reason why utf8_general_cs or utf8_general_bin does not exist ?
@Vincent
There’s the utf8_bin collation
Hello Shlomi,
First of all sorry for my bad english.
About Collation: What is the deiference betwen latin1_general_ci, latin1_spanish_ci and latin1_swedish_ci??
Thanks in advance,
Milton
Hi,
Since I do not speak Spanish nor Swedish I can’t list the exact differences, if any.
Anyhow, it suggests the order of characters (dictionary-wise) in the latin1 alphabet. It is possible that two characters in latin1 will have different order in two different collations.
I believe the MySQL manual provides such example with German.
Great post thanks.