Character sets: latin1 vs. ascii

MySQLConfiguration Data TypesJuly 8, 2009July 8, 2009

Unless specified otherwise, latin1 is the default character set in MySQL.

What I usually find in schemes are columns which are either utf8 or latin1. The utf8 columns being those which need to contain multilingual characters (user names, addresses, articles etc.), and latin1 column being all the rest (passwords, digests, email addresses, hard-coded values etc.)

I find latin1 to be improper for such purposes and suggest that ascii be used instead. The reason being that latin1 implies a European text (with swedish collation). It is unclear for an outsider, when finding a latin1 column, whether it should actually contain West European characters, or is it just being used for ascii text, utilizing the fact that a character in latin1 only requires 1 byte of storage.

Well, this is what the ascii character set is for. When I see an ascii column, I know for sure no West European characters are allowed; just the plain old a-zA-Z0-9 etc. It is clearer from the schema’s definition what the stored values should be.

A note to MySQL

It’s been long since the Swedish roots of the company have dictated defaults. New instances should default to either ascii or utf8 (the latter being the most common and space efficient unicode protocol): character sets that are locale-neutral. Really, how many people realize that when they ORDER BY a text column, rows are sorted according to Swedish dictionary ordering?

8 thoughts on “Character sets: latin1 vs. ascii”

Mchl says:

July 8, 2009 at 10:16 am

Latin1 covers Western European languages. Central Europe is covered by Latin2 CP. 😉

I agree though, utf8 should be introduced as a default encoding, and utf8_general_ci as default collation. AFAIK utf8 stores ASCII characters as single byte values.
shlomi says:

July 8, 2009 at 10:38 am

Thanks for the correction; I’ve updated the text.

I have the opinion that collations should be case sensitive by default; this makes for faster comparisons.

utf8 encodes ASCII as single character – true; by MySQL and its engines do not necessarily follow. For example, if you have CHAR(10) CHARSET utf8, then each such value will take exactly 30 bytes, regardless of content. See also: MySQL’s character sets and collations demystified
hartmut says:

July 8, 2009 at 11:47 am

> For example, if you have CHAR(10) CHARSET utf8, then each such value will take exactly 30 bytes, regardless of content

well, you asked for a fixed size column, so you got a fixed size column, and as it is fixed size it needs to be big enough to store 10 3 byte utf8 sequences up front
shlomi says:

July 8, 2009 at 12:09 pm

hartmut,

Thanks, I think we both agree here.
I saw need to mention that because the misconception that utf8 columns will always require only as much storage as needed – is widespread.
So the notion of “you asked for a fixed size column” is not clear to some.

I hope this clarifies.
Regards
shlomi says:

July 8, 2009 at 12:34 pm

Mchl,

Just as another example, we can define a VARCHAR, utf8 column on a MEMORY table.
I wasn’t asking for fixed width – but MySQL/MEMORY made it so.

Regards

This site uses Akismet to reduce spam. Learn how your comment data is processed.

A note to MySQL

8 thoughts on “Character sets: latin1 vs. ascii”

Leave a Reply