During my work with companies using MySQL, I have encountered many issues with regard to schema design, normalization and indexing. Of the most common errors are incorrect data types definition. Many times the database is designed by programmers or otherwise non-expert DBAs. Some companies do not have the time and cannot spare the effort of redesigning and refactoring their databases, and eventually face poor performance issues.
Here’s a compilation of “the right and the wrong” data types.
INT(1)
is not one byte long.INT(10)
is no bigger thanINT(2)
. The number in parenthesis is misleading, and only describes the text alignment of the number, when displayed in an interactive shell. All mentioned types are the same INT, have the same storage capacity, and the same range. If you want a one-byteINT
, useTINYINT
.
- An integer
PRIMARY KEY
is preferable, especially if you’re using the InnoDB storage engine. If possible, avoid usingVARCHAR
asPRIMARY KEY
. In InnoDB, this will make the clustered index deeper, secondary indexes larger (sometimes much larger) and look ups slower.
- Do not use
VARCHAR
to represent timestamps. It may look like'2008-11-14 07:59:13'
is a textual field, but in fact it’s just an integer counting the seconds elapsed from 1970-01-01. That’s 4 bytes vs. 19 if you’re usingCHAR
withASCII
charset, or more if you’re usingUTF8
orVARCHAR
.
- Do not use
VARCHAR
to represent IPv4 addresses. This one is quite common. The IP 192.168.100.255 can be represented withVARCHAR(15)
, true, but could be better represented with a 4-byte int. That’s what IPv4 is: four bytes. Use theINET_ATON()
andINET_NTOA()
functions to translate between the INT value and textual value.
- This one should be obvious, but I’ve seen it in reality, where the schema was auto generated by some naive generator: do not represent numbers as text. Yes, I have seen integer columns represented by
VARCHAR
. Don’t ask how the performance was.
MD5()
columns shouldn’t beVARCHAR
. UseCHAR(32)
instead. It’s always 32 bytes long, so no need forVARCHAR
‘s additional byte overhead. If your tables or database areUTF8
by default, make sure the MD5 column’s charset isASCII
, or it will consume 96 bytes instead of just 32. I also suggest the case-sensitiveascii_bin
collation, but that’s a more minor issue.
PASSWORD()
columns shouldn’t beVARCHAR
, butCHAR
. The length depends on whether you’re usingold-passwords
variable (for some strange reason, this variable always appears in the MySQL sample configuration files – though you really don’t want it unless it’s for backward compatibility with older MySQL versions). As in the MD5 note, useASCII
charset.
- Better use
TIMESTAMP
thanINT
to count seconds, as MySQL has many supportive functions for this data type.
- Use
TINYINT
,SMALLINT
,MEDIUMINT
instead ofINT
when possible. Do you expect to have 4000000000 customers? No? Then a “id SMALLINT
” may suffice asPRIMARY KEY
.
- Use
CHARACTER SET
s with care. More on this on future posts.
It seems that all the original arguments from shlomi where
refuted. When it comes to IT if you have 5 people you most likely end up with 7 opinions!!! However the recommendation are still valid for a starting point the rest is preference.
Ann
Hi Ann,
Not really π
I stand corrected on hexadecimal values;
IPv4 is still *very much* widespread;
MEDIUMINT is undesired on InnoDB.
The arguments are still very much valid. Not as starting point, but as start-to-end argument.
(Though I am very much susceptible to error and will gladly stand corrected on any point I make)
Regards
Hi,
In my experience having smallint vs int was more to save space but does not necessarily influence performance at least not something noticeable. Many times I come across some white papers where people are trying to prove a point, I run the test and the method they are trying to prove wrong has better performance than the one they are in favor. I have done such test even with the white paper coming out of the company that created the software.
Most of the time, the approach to design / architect databases or application depends on what you are trying to accomplish and what kind of usage the system will be for. Doing consulting for many many years, I’ve come across multiple approaches and styles and if sit down and analyze each one of them they all would have valid justification for their approach or design. It is funny but the more experienced I’ve gotten the less adamant and less critical I have become. This does not mean that I am in favor of carelessness and the obvious.
Thanks,
Ann
Hi Ann,
Thanks for sharing your experience.
In two customer cases I’ve had, and in which we’ve changed data types throughout the schema, we’ve reached 50% savings in disk space, just due to proper data types + charsets.
I personally see this as basic and important part for auditing.
Obviously, there are other parameters to consider. But I think laying out the basics properly, gives boost to further steps.
Regards.
Shlomi
Hi, very interesting blog!
I would ask: how about BIT(1) to store boolean value, for example for field as ‘active’, that can be true or false?
Btw, thanks for this great post!
danilo