Common wrong Data Types compilation

November 18, 2008

During my work with companies using MySQL, I have encountered many issues with regard to schema design, normalization and indexing. Of the most common errors are incorrect data types definition. Many times the database is designed by programmers or otherwise non-expert DBAs. Some companies do not have the time and cannot spare the effort of redesigning and refactoring their databases, and eventually face poor performance issues.

Here's a compilation of "the right and the wrong" data types.

  • INT(1) is not one byte long. INT(10) is no bigger than INT(2). The number in parenthesis is misleading, and only describes the text alignment of the number, when displayed in an interactive shell. All mentioned types are the same INT, have the same storage capacity, and the same range. If you want a one-byte INT, use TINYINT.
  • An integer PRIMARY KEY is preferable, especially if you're using the InnoDB storage engine. If possible, avoid using VARCHAR as PRIMARY KEY. In InnoDB, this will make the clustered index deeper, secondary indexes larger (sometimes much larger) and look ups slower.
  • Do not use VARCHAR to represent timestamps. It may look like '2008-11-14 07:59:13' is a textual field, but in fact it's just an integer counting the seconds elapsed from 1970-01-01. That's 4 bytes vs. 19 if you're using CHAR with ASCII charset, or more if you're using UTF8 or VARCHAR.
  • Do not use VARCHAR to represent IPv4 addresses. This one is quite common. The IP 192.168.100.255 can be represented with VARCHAR(15), true, but could be better represented with a 4-byte int. That's what IPv4 is: four bytes. Use the INET_ATON() and INET_NTOA() functions to translate between the INT value and textual value.
  • This one should be obvious, but I've seen it in reality, where the schema was auto generated by some naive generator: do not represent numbers as text. Yes, I have seen integer columns represented by VARCHAR. Don't ask how the performance was.
  • MD5() columns shouldn't be VARCHAR. Use CHAR(32) instead. It's always 32 bytes long, so no need for VARCHAR's additional byte overhead. If your tables or database are UTF8 by default, make sure the MD5 column's charset is ASCII, or it will consume 96 bytes instead of just 32. I also suggest the case-sensitive ascii_bin collation, but that's a more minor issue.
  • PASSWORD() columns shouldn't be VARCHAR, but CHAR. The length depends on whether you're using old-passwords variable (for some strange reason, this variable always appears in the MySQL sample configuration files - though you really don't want it unless it's for backward compatibility with older MySQL versions). As in the MD5 note, use ASCII charset.
  • Better use TIMESTAMP than INT to count seconds, as MySQL has many supportive functions for this data type.
  • Use TINYINT, SMALLINT, MEDIUMINT instead of INT when possible. Do you expect to have 4000000000 customers? No? Then a "id SMALLINT" may suffice as PRIMARY KEY.
  • Use CHARACTER SETs with care. More on this on future posts.

tags: , ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

23 Comments to "Common wrong Data Types compilation"

  1. x wrote:

    4-byte int for IP address is stupid since there are IPv6 addresses out there and sooner or later you will need to support these in your app.

  2. The Developer Day » Blog Archive » Common wrong Data Types compilation wrote:

    [...] some nice MySQL data types tips how one should create his database schema. This especially counts for intermediate and [...]

  3. Sarah Sproehnle wrote:

    Shlomi,

    Great post! I've seen many of these mistakes too, especially the misunderstanding of integer display width!

  4. shlomi wrote:

    Sarah,
    Thank you. Always good to read your mails on the instructors list!

    x,
    You are right that 4 bytes will not do for IPv6. Nor will the current String representation of IPv4. I have found out that many applications are tightly supporting IPv4 only, anyway. For these applications, using INT is preferable.
    I assume (and hope!) that MySQL will provide an INET_NTOA() function for IPv6 represented by BIGINT.
    -- Update: Dennis corrects below this would be impossible, as IPv6 is larger than BIGINT.

    Shlomi

  5. Roland Bouman wrote:

    Hi!

    Nice post. Just wondering, why the recommendation to store MD5 values as CHAR? Shouldnt it be a BINARY(16)? I mean, you seem to argue that IP addresses should be stored in their binary (=integer) representation, then why not do the same for these hashes (like PASSWORD too, and SHA1 etc.)?

  6. shlomi wrote:

    Hi Roland,
    Good point. I believe that as the common use of MD5 is with its textual form (at least as I experience it), it may be more fitting to represent it as such in the database.
    I agree with your comparison to the IPv4 issue. Perhaps I'm falling in the same trap... I'll need to give it more thought, though it still feels more appropriate for me to use the character representation.

    Shlomi

  7. Dennis Birkholz wrote:

    Hi Shlomi,
    I think there will never be a BIGINT representation of IPv6 addresses as they are 128 bit long and BIGINT is only 64 bit ;-)
    So let's hope for a 128 bit INTEGER type first.

    Greets,
    Dennis

  8. Arjen Lentz wrote:

    I don't recommend MEDIUMINT (which is an odd 3-byter) and internally InnoDB would store it as a 4-byte INT anyway.

  9. shlomi wrote:

    Dennis,
    Ooops ;) Thanks for the correction!

    Arjen,
    Thank you, didn't know that about InnoDB.

  10. Log Buffer #124: a Carnival of the Vanities for DBAs wrote:

    [...] code.openark.org’s Shlomi Noach also was in the business of setting developers’ heads straight (and those of others besides) with his common wrong data types compilation. [...]

  11. Aaron Suggs wrote:

    Roland and Shiomi,

    I store MD5 sums as binary(16). The HEX() and UNHEX() functions allow you to convert between binary data and the application-friendly hexadecimal string.

    HEX() converts binary data to a string.
    UNHEX() converts a string to binary data.

    I recommend storing other fixed-length hexadecimal strings as binary data. E.g., UUIDs are binary(16), and SHA1 hashes are binary(20).

    Here's an example of a table with binary(16) column for MD5 hashes:

    mysql> create table t(md5 binary(16)) engine=innodb;

    To store the MD5 "f7699c9e31b648197a20805989ac0db8", use the UNHEX function:

    mysql> insert into t values (unhex("f7699c9e31b648197a20805989ac0db8"));
    Query OK, 1 row affected (0.00 sec)

    And retrieve it as a string using the HEX function:

    mysql> select hex(md5) from t;
    +----------------------------------+
    | hex(md5) |
    +----------------------------------+
    | F7699C9E31B648197A20805989AC0DB8 |
    +----------------------------------+
    1 row in set (0.00 sec)

    Tada!

    Cheers,
    Aaron

  12. shlomi wrote:

    Aaron - thanks!

    Shlomi

  13. מסד נתונים - הגדרת סוגי שדות נכונים בהתאם לשדה | ואדים גבריאל wrote:

    [...] קראתי פוסט של שלומי נוח לגבי הטעויות הנפוצות בהגדרת סוגי השדות למסד הנתונים. [...]

  14. Garrett W. wrote:

    The only thing about storing the MD5s as binary is that you have to remember to do the HEX()/UNHEX() to use them as text in your application - a minor difficulty for sure, but an invitation for unintentional programming bugs.

    Maybe.

  15. Designing relational databases for PHP in MySQL | The Struggle wrote:

    [...] choosing the wrong data type [...]

  16. annie wrote:

    It seems that all the original arguments from shlomi where
    refuted. When it comes to IT if you have 5 people you most likely end up with 7 opinions!!! However the recommendation are still valid for a starting point the rest is preference.

    Ann

  17. shlomi wrote:

    Hi Ann,

    Not really :)
    I stand corrected on hexadecimal values;
    IPv4 is still *very much* widespread;
    MEDIUMINT is undesired on InnoDB.

    The arguments are still very much valid. Not as starting point, but as start-to-end argument.

    (Though I am very much susceptible to error and will gladly stand corrected on any point I make)

    Regards

  18. annie wrote:

    Hi,

    In my experience having smallint vs int was more to save space but does not necessarily influence performance at least not something noticeable. Many times I come across some white papers where people are trying to prove a point, I run the test and the method they are trying to prove wrong has better performance than the one they are in favor. I have done such test even with the white paper coming out of the company that created the software.

    Most of the time, the approach to design / architect databases or application depends on what you are trying to accomplish and what kind of usage the system will be for. Doing consulting for many many years, I've come across multiple approaches and styles and if sit down and analyze each one of them they all would have valid justification for their approach or design. It is funny but the more experienced I've gotten the less adamant and less critical I have become. This does not mean that I am in favor of carelessness and the obvious.

    Thanks,

    Ann

  19. shlomi wrote:

    Hi Ann,

    Thanks for sharing your experience.
    In two customer cases I've had, and in which we've changed data types throughout the schema, we've reached 50% savings in disk space, just due to proper data types + charsets.
    I personally see this as basic and important part for auditing.

    Obviously, there are other parameters to consider. But I think laying out the basics properly, gives boost to further steps.

    Regards.
    Shlomi

  20. danilo wrote:

    Hi, very interesting blog!
    I would ask: how about BIT(1) to store boolean value, for example for field as 'active', that can be true or false?

    Btw, thanks for this great post!
    danilo

  21. shlomi wrote:

    @Danilo,
    See: Choosing MySQL boolean data types.
    There are good insights in the user comments on that post.
    Basically - BIT(1) is not a good option.

  22. rob wrote:

    Very helpful info...

    In the case of a 32-bit hexadecimal value (UUID), what's the advantage of binary(16) versus char(32)? If you have to use unhex() and hex() calls everytime you read or write the values, would that overhead make char(32) perform better?

  23. shlomi wrote:

    @rob,
    A binary(16) requires less storage. The CPU overhead is nowadays neglectable.

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org