Common wrong Data Types compilation

During my work with companies using MySQL, I have encountered many issues with regard to schema design, normalization and indexing. Of the most common errors are incorrect data types definition. Many times the database is designed by programmers or otherwise non-expert DBAs. Some companies do not have the time and cannot spare the effort of redesigning and refactoring their databases, and eventually face poor performance issues.

Here’s a compilation of “the right and the wrong” data types.

INT(1) is not one byte long. INT(10) is no bigger than INT(2). The number in parenthesis is misleading, and only describes the text alignment of the number, when displayed in an interactive shell. All mentioned types are the same INT, have the same storage capacity, and the same range. If you want a one-byte INT, use TINYINT.

An integer PRIMARY KEY is preferable, especially if you’re using the InnoDB storage engine. If possible, avoid using VARCHAR as PRIMARY KEY. In InnoDB, this will make the clustered index deeper, secondary indexes larger (sometimes much larger) and look ups slower.

Do not use VARCHAR to represent timestamps. It may look like '2008-11-14 07:59:13' is a textual field, but in fact it’s just an integer counting the seconds elapsed from 1970-01-01. That’s 4 bytes vs. 19 if you’re using CHAR with ASCII charset, or more if you’re using UTF8 or VARCHAR.

Do not use VARCHAR to represent IPv4 addresses. This one is quite common. The IP 192.168.100.255 can be represented with VARCHAR(15), true, but could be better represented with a 4-byte int. That’s what IPv4 is: four bytes. Use the INET_ATON() and INET_NTOA() functions to translate between the INT value and textual value.

This one should be obvious, but I’ve seen it in reality, where the schema was auto generated by some naive generator: do not represent numbers as text. Yes, I have seen integer columns represented by VARCHAR. Don’t ask how the performance was.

MD5() columns shouldn’t be VARCHAR. Use CHAR(32) instead. It’s always 32 bytes long, so no need for VARCHAR‘s additional byte overhead. If your tables or database are UTF8 by default, make sure the MD5 column’s charset is ASCII, or it will consume 96 bytes instead of just 32. I also suggest the case-sensitive ascii_bin collation, but that’s a more minor issue.

PASSWORD() columns shouldn’t be VARCHAR, but CHAR. The length depends on whether you’re using old-passwords variable (for some strange reason, this variable always appears in the MySQL sample configuration files – though you really don’t want it unless it’s for backward compatibility with older MySQL versions). As in the MD5 note, use ASCII charset.

Better use TIMESTAMP than INT to count seconds, as MySQL has many supportive functions for this data type.

Use TINYINT, SMALLINT, MEDIUMINT instead of INT when possible. Do you expect to have 4000000000 customers? No? Then a “id SMALLINT” may suffice as PRIMARY KEY.

Use CHARACTER SETs with care. More on this on future posts.

23 thoughts on “Common wrong Data Types compilation”

x says:

November 18, 2008 at 4:37 pm

4-byte int for IP address is stupid since there are IPv6 addresses out there and sooner or later you will need to support these in your app.
Pingback: The Developer Day » Blog Archive » Common wrong Data Types compilation
Sarah Sproehnle says:

November 18, 2008 at 5:56 pm

Shlomi,

Great post! I’ve seen many of these mistakes too, especially the misunderstanding of integer display width!
shlomi says:

November 18, 2008 at 7:17 pm

Sarah,
Thank you. Always good to read your mails on the instructors list!

x,
You are right that 4 bytes will not do for IPv6. Nor will the current String representation of IPv4. I have found out that many applications are tightly supporting IPv4 only, anyway. For these applications, using INT is preferable.
I assume (and hope!) that MySQL will provide an INET_NTOA() function for IPv6 represented by BIGINT.
— Update: Dennis corrects below this would be impossible, as IPv6 is larger than BIGINT.

Shlomi
Roland Bouman says:

November 18, 2008 at 9:09 pm

Hi!

Nice post. Just wondering, why the recommendation to store MD5 values as CHAR? Shouldnt it be a BINARY(16)? I mean, you seem to argue that IP addresses should be stored in their binary (=integer) representation, then why not do the same for these hashes (like PASSWORD too, and SHA1 etc.)?

This site uses Akismet to reduce spam. Learn how your comment data is processed.

23 thoughts on “Common wrong Data Types compilation”

Leave a Reply