Comments on: Character sets: latin1 vs. ascii

By: shlomi

shlomi — Thu, 09 Jul 2009 04:43:58 +0000

Hi Brian,

Somehow I’m not surprised. You guys take the good stuff and throw away the rest!

Shlomi

By: Brian Aker

Brian Aker — Wed, 08 Jul 2009 15:38:19 +0000

Hi!

In Drizzle we made utf8 the default and optimized around it (the default collatin utf8_general_ci). For anything else? Just use binary.

Cheers,
-Brian

By: Mchl

Mchl — Wed, 08 Jul 2009 12:07:57 +0000

Yeah. I forgot how VARCHAR behaves in MEMORY for a moment.
It gets tricky indeed 😉

Personally I use case insensitive collations more often (for user supplied data at least).

By: shlomi

shlomi — Wed, 08 Jul 2009 10:34:23 +0000

Mchl,

Just as another example, we can define a VARCHAR, utf8 column on a MEMORY table.
I wasn’t asking for fixed width – but MySQL/MEMORY made it so.

Regards

By: shlomi

shlomi — Wed, 08 Jul 2009 10:09:59 +0000

hartmut,

Thanks, I think we both agree here.
I saw need to mention that because the misconception that utf8 columns will always require only as much storage as needed – is widespread.
So the notion of “you asked for a fixed size column” is not clear to some.

I hope this clarifies.
Regards

By: hartmut

hartmut — Wed, 08 Jul 2009 09:47:10 +0000

> For example, if you have CHAR(10) CHARSET utf8, then each such value will take exactly 30 bytes, regardless of content

well, you asked for a fixed size column, so you got a fixed size column, and as it is fixed size it needs to be big enough to store 10 3 byte utf8 sequences up front

By: shlomi

shlomi — Wed, 08 Jul 2009 08:38:44 +0000

Thanks for the correction; I've updated the text. I have the opinion that collations should be case sensitive by default; this makes for faster comparisons. utf8 encodes ASCII as single character - true; by MySQL and its engines do not necessarily follow. For example, if you have CHAR(10) CHARSET utf8, then each such value will take exactly 30 bytes, regardless of content. See also: MySQL’s character sets and collations demystified

By: Mchl

Mchl — Wed, 08 Jul 2009 08:16:05 +0000

Latin1 covers Western European languages. Central Europe is covered by Latin2 CP. 😉

I agree though, utf8 should be introduced as a default encoding, and utf8_general_ci as default collation. AFAIK utf8 stores ASCII characters as single byte values.