Two storage engines; different plans, Part II

In Part I of this article, we have seen how the internal structure of the storage engine’s index can affect an execution plan. We’ve seen that some plans are inherent to the way engines are implemented.

We wish to present a second scenario in which execution plans vary for different storage engines. Again, we will consider MyISAM and InnoDB. Again, we will use the world database for testing. This time, we will see how confident the storage engines are in their index search capabilities.

Many newcomers to databases often believe that an index search is always preferable to full table scan. This is not the case. If I were to look for 10 rows in a 1,000,000 rows table, using an indexed column – I could benefit from an index search. However, if I’m looking for 200,000 rows on that table (that’s 20% of the rows) – an index search can actually be much more expensive than a full table scan.

There are several points to consider here: a full table scan is often close to sequential, whereas an index traversal is not. Not only are the index nodes stored non sequentially, but the links from the index to table data may look like a macaroni plate. Also, the index structure itself is a tree-structure, and it can be shown that the number of pages in the index can be larger than the number of pages in the table. Even for partial index scans, it may be worthwhile to simply scan the table.

The threshold above which table scan is preferred is somewhere between 10% and 30% in common DBMS.

We will consider here a scenario where we index a two-valued column, a simple ‘T’ and ‘F’ enum. “That’s a very poor column to index”, you may say. But what if the ratio between the two values is high? Say, 1000:1? Should there be different search plans for the ‘F’ valued rows and for the ‘T’ valued rows?

Let us duplicate the CountryLanguage table, and make it much larger. We will create a table named “cl”, with some 125K rows.

mysql> SHOW CREATE TABLE CountryLanguage \G *************************** 1. row *************************** Table: CountryLanguage Create Table: CREATE TABLE `CountryLanguage` ( `CountryCode` char(3) NOT NULL default '', `Language` char(30) NOT NULL default '', `IsOfficial` enum('T','F') NOT NULL default 'F', `Percentage` float(4,1) NOT NULL default '0.0', PRIMARY KEY (`CountryCode`,`Language`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 1 row in set (0.00 sec)

mysql> CREATE TABLE cl SELECT * FROM CountryLanguage; Query OK, 984 rows affected (0.02 sec) Records: 984 Duplicates: 0 Warnings: 0

And now make it very large:

mysql> INSERT INTO cl SELECT * FROM cl; Query OK, 984 rows affected (0.02 sec) Records: 984 Duplicates: 0 Warnings: 0

…

mysql> INSERT INTO cl SELECT * FROM cl; Query OK, 62976 rows affected (0.08 sec) Records: 62976 Duplicates: 0 Warnings: 0

mysql> UPDATE cl SET IsOfficial='F'; Query OK, 1265 rows affected (0.23 sec) Rows matched: 125952 Changed: 1265 Warnings: 0

mysql> UPDATE cl SET IsOfficial='T' WHERE RAND()<0.001; Query OK, 148 rows affected (0.20 sec) Rows matched: 148 Changed: 148 Warnings: 0

We have now a large table, where the majority of rows have ‘F’ values for ‘IsOfficial’, and the minority have ‘T’. We shall now add an index on this column, and will then make sure the table is in MyISAM (it may be created with another storage engine, depending on our default engine parameter).

mysql> ALTER TABLE cl ADD INDEX (IsOfficial); Query OK, 125952 rows affected (0.31 sec) Records: 125952 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE cl ENGINE=MyISAM; Query OK, 125952 rows affected (1.21 sec) Records: 125952 Duplicates: 0 Warnings: 0

Now let us compare the search plans for ‘F’ and for ‘T’.

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='F' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: cl type: ALL possible_keys: IsOfficial key: NULL key_len: NULL ref: NULL rows: 94464 Extra: Using where 1 row in set (0.02 sec)

What MyISAM decided was that an index search on the ‘F’ rows is useless. A table scan was deemed to be preferable. However, for ‘T’ values rows, the index we created was just fine, and would indeed be used.

InnoDB will state differently.

mysql> ALTER TABLE cl ENGINE=InnoDB; Query OK, 125952 rows affected (1.07 sec) Records: 125952 Duplicates: 0 Warnings: 0

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='F' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: cl type: ref possible_keys: IsOfficial key: IsOfficial key_len: 1 ref: const rows: 61667 Extra: Using where 1 row in set (0.00 sec)

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='T' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: cl type: ref possible_keys: IsOfficial key: IsOfficial key_len: 1 ref: const rows: 148 Extra: Using where 1 row in set (0.00 sec)
On the ‘T’ search, MyISAM and InnoDB agree. But look at the plan for the ‘F’ rows: InnoDB still prefers an index search to table scan, even though it estimates a lookup on 50% of the rows.

The behavior just exposed is not entirely consistent. InnoDB and MyISAM differ in the way they update the index statistics. While ANALYZE TABLE on MyISAM performs an exaustive search on index values, InnoDB will only do 10 random test dives and return with a rough calculation. In fact, InnDB’s estimations can greatly vary from the real values distribution, and successive calls to ANALYZE table can produce varying results.

What has been presented in this part is not a rule to live by. You shouldn’t base your queries or expected behavior on the index distribution or search plan calculated by the storage engine. These may change in time. What’s instructive here is the freedom MySQL gives the storage engines in decision making, and the different actions taken when dealing with different engines.

3 thoughts on “Two storage engines; different plans, Part II”

Pingback: Log Buffer #123: a Carnival of the Vanities for DBAs
Pingback: What MySQL Execution Plan Depends On | SimonOnSoftware
Pingback: SQL: finding a user’s country/region based on IP | code.openark.org

This site uses Akismet to reduce spam. Learn how your comment data is processed.

3 thoughts on “Two storage engines; different plans, Part II”

Leave a Reply