Two storage engines; different plans, Part II

In Part I of this article, we have seen how the internal structure of the storage engine’s index can affect an execution plan. We’ve seen that some plans are inherent to the way engines are implemented.

We wish to present a second scenario in which execution plans vary for different storage engines. Again, we will consider MyISAM and InnoDB. Again, we will use the world database for testing. This time, we will see how confident the storage engines are in their index search capabilities.

Many newcomers to databases often believe that an index search is always preferable to full table scan. This is not the case. If I were to look for 10 rows in a 1,000,000 rows table, using an indexed column – I could benefit from an index search. However, if I’m looking for 200,000 rows on that table (that’s 20% of the rows) – an index search can actually be much more expensive than a full table scan.

There are several points to consider here: a full table scan is often close to sequential, whereas an index traversal is not. Not only are the index nodes stored non sequentially, but the links from the index to table data may look like a macaroni plate. Also, the index structure itself is a tree-structure, and it can be shown that the number of pages in the index can be larger than the number of pages in the table. Even for partial index scans, it may be worthwhile to simply scan the table.

The threshold above which table scan is preferred is somewhere between 10% and 30% in common DBMS.

We will consider here a scenario where we index a two-valued column, a simple ‘T’ and ‘F’ enum. “That’s a very poor column to index”, you may say. But what if the ratio between the two values is high? Say, 1000:1? Should there be different search plans for the ‘F’ valued rows and for the ‘T’ valued rows?

Let us duplicate the CountryLanguage table, and make it much larger. We will create a table named “cl”, with some 125K rows.

mysql> SHOW CREATE TABLE CountryLanguage \G
*************************** 1. row ***************************
Table: CountryLanguage
Create Table: CREATE TABLE `CountryLanguage` (
`CountryCode` char(3) NOT NULL default '',
`Language` char(30) NOT NULL default '',
`IsOfficial` enum('T','F') NOT NULL default 'F',
`Percentage` float(4,1) NOT NULL default '0.0',
PRIMARY KEY  (`CountryCode`,`Language`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> CREATE TABLE cl SELECT * FROM CountryLanguage;
Query OK, 984 rows affected (0.02 sec)
Records: 984  Duplicates: 0  Warnings: 0

And now make it very large:

mysql> INSERT INTO cl SELECT * FROM cl;
Query OK, 984 rows affected (0.02 sec)
Records: 984  Duplicates: 0  Warnings: 0

mysql> INSERT INTO cl SELECT * FROM cl;
Query OK, 62976 rows affected (0.08 sec)
Records: 62976  Duplicates: 0  Warnings: 0

mysql> UPDATE cl SET IsOfficial='F';
Query OK, 1265 rows affected (0.23 sec)
Rows matched: 125952  Changed: 1265  Warnings: 0

mysql> UPDATE cl SET IsOfficial='T' WHERE RAND()<0.001;
Query OK, 148 rows affected (0.20 sec)
Rows matched: 148  Changed: 148  Warnings: 0

We have now a large table, where the majority of rows have ‘F’ values for ‘IsOfficial’, and the minority have ‘T’. We shall now add an index on this column, and will then make sure the table is in MyISAM (it may be created with another storage engine, depending on our default engine parameter).

mysql> ALTER TABLE cl ADD INDEX (IsOfficial);
Query OK, 125952 rows affected (0.31 sec)
Records: 125952  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE cl ENGINE=MyISAM;
Query OK, 125952 rows affected (1.21 sec)
Records: 125952  Duplicates: 0  Warnings: 0

Now let us compare the search plans for ‘F’ and for ‘T’.

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='F' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ALL
possible_keys: IsOfficial
key: NULL
key_len: NULL
ref: NULL
rows: 94464
Extra: Using where
1 row in set (0.02 sec)

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='T' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ref
possible_keys: IsOfficial
key: IsOfficial
key_len: 1
ref: const
rows: 138
Extra: Using where
1 row in set (0.00 sec)

What MyISAM decided was that an index search on the ‘F’ rows is useless. A table scan was deemed to be preferable. However, for ‘T’ values rows, the index we created was just fine, and would indeed be used.

InnoDB will state differently.

mysql> ALTER TABLE cl ENGINE=InnoDB;
Query OK, 125952 rows affected (1.07 sec)
Records: 125952  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='F' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ref
possible_keys: IsOfficial
key: IsOfficial
key_len: 1
ref: const
rows: 61667
Extra: Using where
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='T' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ref
possible_keys: IsOfficial
key: IsOfficial
key_len: 1
ref: const
rows: 148
Extra: Using where
1 row in set (0.00 sec)

On the ‘T’ search, MyISAM and InnoDB agree. But look at the plan for the ‘F’ rows: InnoDB still prefers an index search to table scan, even though it estimates a lookup on 50% of the rows.

The behavior just exposed is not entirely consistent. InnoDB and MyISAM differ in the way they update the index statistics. While ANALYZE TABLE on MyISAM performs an exaustive search on index values, InnoDB will only do 10 random test dives and return with a rough calculation. In fact, InnDB’s estimations can greatly vary from the real values distribution, and successive calls to ANALYZE table can produce varying results.

What has been presented in this part is not a rule to live by. You shouldn’t base your queries or expected behavior on the index distribution or search plan calculated by the storage engine. These may change in time. What’s instructive here is the freedom MySQL gives the storage engines in decision making, and the different actions taken when dealing with different engines.

3 thoughts on “Two storage engines; different plans, Part II

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.