Risking some flames, I’d like to suggest only two options for choosing PRIMARY KEYs for InnoDB tables. I suggest they should cover 99% (throwing numbers around) of cases.
PRIMARY KEY cases
- An integer (SMALLINT / INT / BIGINT), possibly AUTO_INCREMENT column.
- The combination of two columns on a many-to-many connecting table (e.g. film_actor, which connects films to actors), the two columns being the PRIMARY KEYs of respective data tables. This rule may be extended to 3-way relation tables.
A short recap: an InnoDB must have a PRIMARY KEY. It will pick one if you don’t offer it. It can pick a really bad UNIQUE KEY (e.g. website_url(255)) or make one up using InnoDB internal row ids. If you don’t have a good candidate, an AUTO_INCREMENT PRIMARY KEY is probably the easiest way out.
A 2-column combination for a many-to-many connection table is common and viable. The PRIMARY KEY will not only provide with good join access method, but will also provide with the required UNIQUE constraint.
An integer-based PRIMARY KEY will make for more compact & shallow index tree structures, which leads to less I/O and page reads.
An AUTO_INCREMENT will allow for ascending PRIMARY KEY order of INSERT, which is InnoDB-friendly: index pages will be more utilized, less fragmented.
Exceptions
- You have a partitioned table, e.g. on date range. With partitioned tables, every UNIQUE KEY, including the PRIMARY KEY, must include partitioning columns. In such case you will have to extend the PRIMARY KEY.
- The only key on your table is a unique constraint on some column, e.g. UNIQUE KRY (url). On one hand, it seems wasteful to create another column (e.g. AUTO_INCREMENT) to use as PRIMARY KEY. On the other hand, I’ve seen many cases where this kind of PK didn’t hold up. At some point there was need for another index. Or some method had to be devised for chunking up table data (oak-chunk-update can do that even with non-integer PKs). I’m reluctant to use such keys as PRIMARY.
- I’m sure there are others.
Umm…
I wrote the draft for this post a while ago. And then came Domas and ruined it. Wait for 5.1.52?
There are cases where selecting a non-obvious primary key massively improves performance, by having the rows you’re likely to be interested in clustered.
Take the case where you have some kind of Message ID which is generated in a fairly arbitrary fashion; this on its own is a valid candidate key,
However, if we add the thing that customers are likely to search on (the customer ID) as the first part of the primary key, and make it (customerID, messageID), then the customer’s messages all cluster together, in the same pages, which massively reduces the number of IOPS on these queries.
This is, of course, a special case. Typical OLTP-type apps don’t need to care about it, nor does anyone else whose database fits in memory.
Mark