MySQL's character sets and collations demystified

December 8, 2008

MySQL's character sets and collations are often considered as a mystery, and many users either completely disregard them and keep with the defaults, or set everything to UTF8.

This post will attempt to shed some light on the mystery, and provide with some best practices for use with text columns with regard to character sets.

Character Sets

A thorough discussion of how the character sets have evolved through history is beyond the scope of this post. While the Unicode standard is gaining recognition, the "older" character sets are still around. Understanding the difference between Unicode and local character sets is crucial.

Consider, for example, MySQL's latin1 character set. In this character set there are 256 different characters, represented by one byte. The first 128 characters map to ASCII, the standard "ABCabc012 dot comma" set, of which most of this post is composed. The latter 128 characters in latin1 are composed of West European specific characters, such as À, ë, õ, Ñ.

A Name VARCHAR(60) CHARSET latin1 column can describe names with West European characters. But it cannot describe Russian or Hebrew names. To represent a name in Hebrew, you'd need the hebrew charset (ISO 8859-8), in which the first 128 characters are, as always, mapped to ASCII, and the latter 128 characters describe the Hebrew alphabet and punctuation marks, such as ש,ל,מ,ה. The Cyrillic, Arabic and Turkish charsets follow in a similar manner.

Assume now I'm building a world wide web application, such as a popular social network. I would like to store the first names of my users, in every possible language. None of the above character sets support all languages. I therefore turn to Unicode. In particular, MySQL supports utf8, a Unicode encoding scheme, which is commonly used due to its economic storage requirements.

In Unicode there is a dedicated number for each letter in the known languages, in ancient languages, and some imaginary or otherwise non existing languages, such as Klingon (yes, I know there are people who actually speak Klingon), may yet find their way into the standard.

UTF8 (or utf8), a Unicode encoding scheme, states the following: for ASCII characters, such as 'a', '6', '$', only one byte of storage is required. For Hebrew, Cyrillic or Turkish characters, 2 bytes are required. For Japanese, Chinese - more (MySQL supports up to 3 bytes per character). Again, the exact details of the implementation are beyond the scope of this post, and are well described here and here.

What's important to me is that I can define Name VARCHAR(30) CHARSET utf8 for my columns, and Voila! Any name can be represented in my database.

So why not define everything as utf8 and get done with it?

Well, it just so happens that Unicode comes with a price. See, for example, the following column definition:

CountryCode CHAR(3) CHARSET utf8

We are asking for a column with 3 characters exactly. The required storage for this column will be such that any 3-letter name must fit in. This means (3 characters) times (3 bytes per character) = 9 bytes of storage. So CHAR and utf8 together may be less than ideal. VARCHAR behaves better: it only requires as many bytes per character as described above. So the text "abc" will only require 3 bytes (plus VARCHAR's leading 1 or 2 bytes).

Why don't we drop the 'CHAR' altogether, then, and use only 'VARCHAR'?

Because some values are simply better represented with CHAR: consider a "password" column, encoded with MD5. The MD5() function returns a 32 characters long text. It's always 32 characters, and, moreover, it's always in ASCII. The best data type and character set definition would be password CHAR(32) CHARSET ascii. We thus ensure exactly 32 bytes are allocated to this column. A VARCHAR will acquire an additional byte or two, depending on its defined length, which will indicate the length of the text.

And why would I care about collations?

Collations deal with text comparison. We observed that the default character set in MySQL is latin1. The default collation is latin1_swedish_ci. In this collation the following holds true: 'ABC' = 'abc'.

Wait. What?

Look at the "ci" in latin1_swedish_ci. It stands for "case insensitive". Collations which end with "cs" or "bin" are case sensitive. The utf8 character set comes with utf8_general_ci collation. This can make sense. Let's review our web application table (I'm using plain text passwords here, bare with me for this example):

CREATE TABLE my_users (
  name VARCHAR(30) CHARSET utf8 COLLATE utf8_general_ci,
  plainPassword VARCHAR(16) CHARSET ASCII,
  UNIQUE KEY (name)
);
INSERT INTO my_users (name, password) VALUES ('David', 'mypass');

It holds true that the name 'David' equals 'david'. If I were to SELECT * FROM my_users WHERE name='david', I would find the desired row. The unique key will also guarantee that no daVID user can be added.

But David certainly wouldn't want users to login with the password 'MYPASS'. So we refine our table:

CREATE TABLE my_users (
  name VARCHAR(30) CHARSET utf8 COLLATE utf8_general_ci,
  plainPassword VARCHAR(16) CHARSET ascii COLLATE ascii_bin,
  UNIQUE KEY (name)
);

The ascii_bin collation is a case sensitive collation for ascii. The following will not find anything:

SELECT * FROM my_users WHERE name='david' AND plainPassword='MYPASS';

Holding a plain text password in your database is not a best practice, but apparently it's common.

Collations also deal with text ordering. For any two strings, the collation determines which is larger, or if they are equal. Probably the most common situation you see collations in action is when you ORDER BY a text column.

Also keep in mind

  • When you check for length of strings, do you use the LENGTH() function, as in SELECT LENGTH(Name) FROM City? You probably wish to replace this with CHAR_LENGTH(). LENGTH() returns the number of bytes required for the text storage. CHAR_LENGTH() returns the number of characters in the text, and is usually what you are looking for. It may hold true that for a string s, LENGTH(s)=12 and CHAR_LENGTH(s)=8. Watch out for these glitches.
  • You can converts texts between character sets with CONVERT. For example: CONVERT(s USING utf8)
  • Stored routines should not be overlooked. If your stored routine accepts a text argument, or if your stored function returns one, make sure the character sets are properly defined. If not, then your utf8 text may be converted to latin1 during the call to your stored routine. This also applies to local parameters within the stored routines.
  • An ALTER TABLE <some table> CONVERT TO <some charset> will change the character set not only for the table itself, but also for all existing textual columns.

See the following post: Useful database analysis queries with INFORMATION_SCHEMA for queries which diagnose your databases character sets.

tags: , ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

21 Comments to "MySQL's character sets and collations demystified"

  1. Jason Stubbs wrote:

    It's also worth noting that the difference between CHAR/VARCHAR space savings only occur on disk as fields are always fixed width in memory. This means that even though setting the md5 password column to be a VARCHAR(32) CHARSET utf8 wouldn't be much different to CHAR(32) CHARSET ascii on disk, in memory the usage would still be three times.

  2. shlomi wrote:

    Jason, thanks, good point.

    It even goes beyond that: whenever MySQL creates a temporary table due to non-indexed ORDER BY, or a VIEW query etc., it allocates the full extent for VARCHAR, as if it were a CHAR field. This means that expensive operations which must pay with temporary tables, may become prohibitively more expensive.

  3. The Developer Day » Blog Archive » MySQL character sets and collations wrote:

    [...] great article i found about MySQL character sets and collations that demystifies all the so often found problems in projects made by confused or not so [...]

  4. Chris wrote:

    An md5 of the password is not actually ASCII but hexadecimal. You would save twice the space if you saved it as a binary field and used HEX() and UNHEX() to insert and read the data.

  5. Vladislav Vaintroub wrote:

    Nope, there is no Klingon in Unicode.

  6. Vladislav Vaintroub wrote:

    And there will probably never be Klingon in Unicode. They submitted their proposal for 4.0 and it got rejected at that time and later for 5.0. The language is too weird and unreal to be included

  7. shlomi wrote:

    Vladislav,
    Like I said, it may yet find its way into Unicode... :)
    There's Tolkien's elves language, as well, which is competing for Unicode as well, as far as I know.

    Regards

  8. shlomi wrote:

    Chris,

    Thanks, indeed so.
    MySQL's PASSWORD() function also returns hexadecimal text, but nevertheless the passwords are stored as CHAR(41), not BINARY(20).
    It's common use to store MD5 as text. I'm not sure myself if it should be so.
    See my earlier post: http://code.openark.org/blog/?p=85

    Shlomi

  9. PHP Freaks Forums > SQL / Database Forums > MySQL Help > Topic: MySQL’s character sets and collations demystified wrote:

    This article has been 'sticked' on phpfreaks forum.

  10. MySQL 5.1 riding on UTF8 collation (client AND server) Tutorial wrote:

    [...] So what exactly is collation in regards to the MySQL’s client and server? Well collation is how the MySQL client and server deals with sorting and comparing text in the database. A more formal explanation of collations and character sets can be found here (http://code.openark.org/blog/mysql/mysqls-character-sets-and-collations-demystified). [...]

  11. Vincent wrote:

    Very interesting article. Any reason why utf8_general_cs or utf8_general_bin does not exist ?

  12. shlomi wrote:

    @Vincent

    There's the utf8_bin collation

  13. Milton wrote:

    Hello Shlomi,
    First of all sorry for my bad english.

    About Collation: What is the deiference betwen latin1_general_ci, latin1_spanish_ci and latin1_swedish_ci??

    Thanks in advance,

    Milton

  14. shlomi wrote:

    Hi,

    Since I do not speak Spanish nor Swedish I can't list the exact differences, if any.
    Anyhow, it suggests the order of characters (dictionary-wise) in the latin1 alphabet. It is possible that two characters in latin1 will have different order in two different collations.
    I believe the MySQL manual provides such example with German.

  15. Tali Luvhengo wrote:

    Great post thanks.

  16. MySQL collations demystified :: Kelvin Tan - Lucene Solr Nutch Consultant wrote:

    [...] by Kelvin on 07 Jun 2010 at 06:52 pm | Tagged as: programming Good article here: http://code.openark.org/blog/mysql/mysqls-character-sets-and-collations-demystified MySQL’s character sets and collations are often considered as a mystery, and many users either [...]

  17. Aneeq wrote:

    The following link has an excellent explanation of how to enable Arabic, Urdu and other foreign language support in PHP/MySQL.

    http://addr.pk/a44d

    OR

    http://phphelp.co/2012/03/30/how-to-make-php-and-mysql-support-arabic-urdu-and-charactersets-or-other-foreign-languages/

  18. unreal4u wrote:

    What you are saying about CHAR being allocated as 3 byte per character seems not entirely true in my tests, I executed the following:


    CREATE TABLE a (
    a CHAR(3) CHARSET utf8 COLLATE utf8_general_ci
    ) ENGINE=MyISAM;

    INSERT INTO a VALUES ('abc'),('áéí'),('ábc');

    SELECT LENGTH(a) AS byteLength, CHAR_LENGTH(a) AS charLength FROM a;

    This will deliver the following result:

    3 3
    6 3
    4 3

    The BYTE size varies according to which type of characters are stored, so it seems that MySQL is capable of varying the size of the internal stored data, even if the column is defined as a fixed length field.

    Am I correct or did I miss something?

    Greetings.

  19. unreal4u wrote:

    @Milton: the different collations you mention incorporate or excludes different letters which alters the ordering process of a query. I don't know about swedish, but one of the main differences between spanish_ci and spanish2_ci is that the latter includes the letters "CH" (between C and D) and "LL" (between L and M), which are not valid anymore in modern spanish. I've never used latin1_general_ci though, but I think it must include "CH" but not "LL" or something like that.

  20. Sobre COLLATION y CHARSET en MySQL « unreal4u's Personal Network wrote:

    [...] code.openark.org dev [...]

  21. shlomi wrote:

    @unreal4u,

    Very interesting indeed. I believe the LENGTH() function returns the size in bytes of the given string, not the size of the stored string. That is, it is independent of the underlying storage type.

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org