Execution plan – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Mon, 10 Oct 2011 05:48:10 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 Speaking on Percona Live, London: “Programmatic Queries: things you can code with SQL” https://shlomi-noach.github.io/blog/mysql/speaking-on-percona-live-london-programmatic-queries-things-you-can-code-with-sql https://shlomi-noach.github.io/blog/mysql/speaking-on-percona-live-london-programmatic-queries-things-you-can-code-with-sql#comments Mon, 10 Oct 2011 05:48:10 +0000 https://shlomi-noach.github.io/blog/?p=4043 I’ll be speaking at the Percona Live event, held in London, October 24, 25, 2011.

My session is called Programmatic Queries: things you can code with SQL. It’s a short 30 minute talk, in which I present underlying knowledge of the programmatic nature of SQL queries within MySQL, and how to take advantage of such knowledge so as to build faster, shorter, and sometimes unexpected queries.

This is not about stored routine programming, a classic programmatic aspect of MySQL, but rather about expected order of execution: of row evaluation, of control flow statements, of table inference, of time issues.

I have far too many examples, some real-world problem solvers, and some less common in daily use, to be able to deliver them all on this session. I will pick up those which seem most interesting to me, or those best presenting the programmatic nature of the query. As time allows I may add more examples, or look into interesting future possibilities.

I hope to see you there.

]]>
https://shlomi-noach.github.io/blog/mysql/speaking-on-percona-live-london-programmatic-queries-things-you-can-code-with-sql/feed 2 4043
Views: better performance with condition pushdown https://shlomi-noach.github.io/blog/mysql/views-better-performance-with-condition-pushdown https://shlomi-noach.github.io/blog/mysql/views-better-performance-with-condition-pushdown#comments Thu, 20 May 2010 05:17:05 +0000 https://shlomi-noach.github.io/blog/?p=1328 Justin’s A workaround for the performance problems of TEMPTABLE views post on mysqlperformanceblog.com reminded me of a solution I once saw on a customer’s site.

The customer was using nested views structure, up to depth of some 8-9 views. There were a lot of aggregations along the way, and even the simplest query resulted with a LOT of subqueries, temporary tables, and vast amounts of data, even if only to return with a couple of rows.

While we worked to solve this, a developer showed me his own trick. His trick is now impossible to implement, but there’s a hack around this.

Let’s use the world database to illustrate. Look at the following view definition:

CREATE
  ALGORITHM=TEMPTABLE
VIEW country_languages AS
  SELECT
    Country.CODE, Country.Name AS country,
    GROUP_CONCAT(CountryLanguage.Language) AS languages
  FROM
    world.Country
    JOIN world.CountryLanguage ON (Country.CODE = CountryLanguage.CountryCode)
  GROUP BY
    Country.CODE;

The view presents with a list of spoken languages per country. The execution plan for querying this view looks like this:

mysql> EXPLAIN SELECT * FROM country_languages;
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref                               | rows | Extra                                        |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>      | ALL    | NULL          | NULL    | NULL    | NULL                              |  233 |                                              |
|  2 | DERIVED     | CountryLanguage | index  | PRIMARY       | PRIMARY | 33      | NULL                              |  984 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | Country         | eq_ref | PRIMARY       | PRIMARY | 3       | world.CountryLanguage.CountryCode |    1 |                                              |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+

And, even if we only want to filter out a single country, we still get the same plan:

mysql> EXPLAIN SELECT * FROM country_languages WHERE Code='USA';
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref                               | rows | Extra                                        |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>      | ALL    | NULL          | NULL    | NULL    | NULL                              |  233 | Using where                                  |
|  2 | DERIVED     | CountryLanguage | index  | PRIMARY       | PRIMARY | 33      | NULL                              |  984 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | Country         | eq_ref | PRIMARY       | PRIMARY | 3       | world.CountryLanguage.CountryCode |    1 |                                              |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+

So, we need to scan the entire country_language and country tables in order to return results for just one row.

A non-working solution

The solution offered by the developer was this:

CREATE
  ALGORITHM=MERGE
  VIEW country_languages_non_working AS
  SELECT
    Country.CODE, Country.Name AS country,
    GROUP_CONCAT(CountryLanguage.Language) AS languages
  FROM
    world.Country
    JOIN world.CountryLanguage ON
      (Country.CODE = CountryLanguage.CountryCode)
  WHERE
    Country.CODE = @country_code
  GROUP BY Country.CODE;

And follow by:

mysql> SET @country_code='USA';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM country_languages_2;
+------+---------------+----------------------------------------------------------------------------------------------------+
| CODE | country       | languages                                                                                          |
+------+---------------+----------------------------------------------------------------------------------------------------+
| USA  | United States | Chinese,English,French,German,Italian,Japanese,Korean,Polish,Portuguese,Spanish,Tagalog,Vietnamese |
+------+---------------+----------------------------------------------------------------------------------------------------+

So, pushdown a WHERE condition into the view’s definition. The session variable @country_code is used to filter rows. In the above simplified code the value is assumed to be set; tweak it as you see fit (using IFNULL, for example, or OR statements) to allow for full scan in case the variable is undefined.

This doesn’t work. It used to work a couple years back; but today you cannot create a view which uses session variables or parameters. It is a restriction imposed by views.

A workaround

Justin showed a workaround using an additional table. There is another workaround which does not involve tables, but rather stored routines. Now, this is a patch, and an ugly one. It may not work in future versions of MySQL for all I know. But, here it goes:

DELIMITER $$
CREATE DEFINER=`root`@`localhost` FUNCTION `get_session_country`() RETURNS CHAR(3)
    NO SQL
    DETERMINISTIC
BEGIN
  RETURN @country_code;
END $$
DELIMITER ;

CREATE
  ALGORITHM=MERGE
  VIEW country_languages_2 AS
  SELECT
    Country.CODE, Country.Name AS country,
    GROUP_CONCAT(CountryLanguage.Language) AS languages
  FROM
    world.Country
    JOIN world.CountryLanguage ON
      (Country.CODE = CountryLanguage.CountryCode)
  WHERE
    Country.CODE = get_session_country()
  GROUP BY Country.CODE;

And now:

mysql> SET @country_code='USA';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM country_languages_2;
+------+---------------+----------------------------------------------------------------------------------------------------+
| CODE | country       | languages                                                                                          |
+------+---------------+----------------------------------------------------------------------------------------------------+
| USA  | United States | Chinese,English,French,German,Italian,Japanese,Korean,Polish,Portuguese,Spanish,Tagalog,Vietnamese |
+------+---------------+----------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT * FROM country_languages_2;
+----+-------------+-----------------+--------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+-------------+-----------------+--------+---------------+---------+---------+------+------+--------------------------+
|  1 | PRIMARY     | <derived2>      | system | NULL          | NULL    | NULL    | NULL |    1 |                          |
|  2 | DERIVED     | Country         | const  | PRIMARY       | PRIMARY | 3       |      |    1 |                          |
|  2 | DERIVED     | CountryLanguage | ref    | PRIMARY       | PRIMARY | 3       |      |    8 | Using where; Using index |
+----+-------------+-----------------+--------+---------------+---------+---------+------+------+--------------------------+

Since views are allowed to call stored routines (Justing used this to call upon CONNECTION_ID()), and since stored routines can use session variables, we can take advantage and force the view into filtering out irrelevant rows before these accumulate to temporary tables and big joins.

Back in the customer’s office, we witnessed, what with their real data and multiple views, a reduction of query times from ~30 minutes to a few seconds.

Another kind of use

Eventually we worked to make better view definitions and query splitting, resulting in clearer code and fast queries, but this solution plays nicely into another kind of problem:

Can we force different customers to see different parts of a given table? e.g., only those rows that relate to the customers?

There can be many solutions: different tables; multiple views (one per customer), stored procedures, what have you. The above provides a solution, and I’ve seen it in use.

]]>
https://shlomi-noach.github.io/blog/mysql/views-better-performance-with-condition-pushdown/feed 1 1328
EXPLAIN: missing db info https://shlomi-noach.github.io/blog/mysql/explain-missing-db-info https://shlomi-noach.github.io/blog/mysql/explain-missing-db-info#comments Tue, 11 May 2010 04:57:02 +0000 https://shlomi-noach.github.io/blog/?p=2368 I’m further developing a general log hook, which can stream queries from the general log.

A particular direction I’m taking is to filter queries by their type of actions. For example, the tool (oak-hook-general-log) can be instructed to only stream out those queries which involve creation of a temporary table; or those which cause for a filesort, or full index scan, etc.

This is done by evaluating of query execution plans on the fly. I suspect the MySQL query analyzer roughly does the same (as a small part of what it does).

It’s almost nothing one cannot do with sed/awk. However, I bumped into a couple of problems:

  1. The general log (and the mysql.general_log table, in  particular) does not indicate the particular database one is using for the query. Since slow log does indicate this data, I filed a bug on this. I currently solve this by crossing connection id with the process list, where the current database is listed. It’s shaky, but mostly works.
  2. Just realized: there’s no DB info in the EXPLAIN output! It’s weird, since I’ve been EXPLAINing queries for years now. But I’ve always had the advantage of knowing the schema used: either because I was manually executing the query on a known schema, or mk-query-digest was kind enough to let me know.

For example, look at the following imaginary query, involving both the world and sakila databases:

mysql> use test;
Database changed
mysql> EXPLAIN SELECT * FROM world.Country JOIN sakila.city WHERE Country.Capital = city.city_id;
+----+-------------+---------+--------+---------------+---------+---------+-----------------------+------+-------------+
| id | select_type | table   | type   | possible_keys | key     | key_len | ref                   | rows | Extra       |
+----+-------------+---------+--------+---------------+---------+---------+-----------------------+------+-------------+
|  1 | SIMPLE      | Country | ALL    | NULL          | NULL    | NULL    | NULL                  |  239 |             |
|  1 | SIMPLE      | city    | eq_ref | PRIMARY       | PRIMARY | 2       | world.Country.Capital |    1 | Using where |
+----+-------------+---------+--------+---------------+---------+---------+-----------------------+------+-------------+
2 rows in set (0.00 sec)

The query is imaginary, since the tables don’t actually have anything in common. But look at the EXPLAIN result: can you tell where city came from? Country can somehow be parsed from the ref column, but no help on city.

Moreover, table aliases show on the EXPLAIN plan (which is good), but with no reference to the original table.

So, is it back to parsing of the SQL query? I’m lazy reluctant to do that. It’s error prone, and one needs to implement, or use, a good parser, which also accepts MySQL dialect. Haven’t looked into this yet.

I’m currently at a standstill with regard to automated query execution plan evaluation where database cannot be determined.

]]>
https://shlomi-noach.github.io/blog/mysql/explain-missing-db-info/feed 13 2368
Things to monitor on MySQL, the user’s perspective https://shlomi-noach.github.io/blog/mysql/things-to-monitor-on-mysql-the-users-perspective https://shlomi-noach.github.io/blog/mysql/things-to-monitor-on-mysql-the-users-perspective#comments Wed, 10 Mar 2010 09:12:24 +0000 https://shlomi-noach.github.io/blog/?p=2008 Working on mycheckpoint, I have the intention of adding custom monitoring. That is, letting the user define things to monitor. I have my own thoughts, I would be grateful to get more input!

What would the user want to monitor?

Monitoring for the number of SELECT statements per second, InnoDB locks, slave replication lag etc. is very important, and monitoring utilities provide with this information. But what does that tell the end user? Not much.

The experienced DBA may gain a lot. The user would be more interested in completely other kind of information. In between, some information is relevant to both.

Say we were managing an on-line store. We want to monitor the health of the database. But the health of the database is inseparable from the health of the application. I mean, having little to no disk usage is fine, unless… something is wrong with the application, which leads to no new purchases.

And so a user would be interested in monitoring the number of purchases per hour, or the time passed since last successful purchase. This kind of data can only be generated by a user’s specific query. Looking at the charts, the user would then feel safer and confident in the wellness of his store app.

But let’s dig further. We want the store’s website to provide with good response. In particular, the query which returns the items in a customer’s cart must react quickly. Our user would not only want to see that purchases get along, but also that page load times (as in our example) are quick for those critical parts. And so a user should be able to monitor the time it took to execute a given query.

It can be of further interest to know how many times per second a given query is executed. This part is not easily done on the server side, and requires the user’s cooperation (or else we must analyze the general log, sniff, or set up a proxy). If the user is willing, she can log to some table each time she executes a certain query. Then we’re back to monitoring a regular table, as with the first example.

It is also possible to monitor for a query’s execution plan. Is it full scan? How many rows are expected? But given that we can monitor the time it took to execute a query, I’m not sure this is useful. If everything runs fast enough — who cares about how it executes?

Some of the above can be monitored on an altogether higher level: if  we’re talking about some web application, then we can use our Apache logs to determine load time for pages, or number of requests to our “cart items” page. But not always do we work with web servers, and we may be interested in checking the specific queries behind the scenes.

Summary

Custom monitoring can include:

  • User defined queries (number of concurrent visitors; count of successful operations per second; number of rows per given table or condition; …)
  • Execution time for user defined queries (time it takes to return cart items; find rows matching condition; sort a table; …)
  • Number of executions for a given query, per second.

I intend to incorporate the above into mycheckpoint as part of its standard monitoring scheme.

Please share your thought below.

]]>
https://shlomi-noach.github.io/blog/mysql/things-to-monitor-on-mysql-the-users-perspective/feed 2 2008
SQL: Ranking without self join https://shlomi-noach.github.io/blog/mysql/sql-ranking-without-self-join https://shlomi-noach.github.io/blog/mysql/sql-ranking-without-self-join#comments Mon, 14 Sep 2009 16:16:11 +0000 https://shlomi-noach.github.io/blog/?p=1293 The common way of solving the classic SQL problem of ranking, involves a  self join. I wish to present a different solution, which only iterates the table once, and provides the same output.

The ranking problem

Given a table with names and scores (e.g. students exams scores), add rank for each row, such that the rank identifies her position among other rows. Rows with identical scores should receive the same rank (e.g. both contenders got the silver medal).

Consider the following table (download score.sql):

mysql> select * from score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
|        1 | Wallace      |    95 |
|        2 | Gromit       |    97 |
|        3 | Shaun        |    85 |
|        4 | McGraw       |    92 |
|        5 | Preston      |    92 |
+----------+--------------+-------+
5 rows in set (0.00 sec)

We wish to present ranks in some way similar to:

+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        2 | Gromit       |    97 |    1 |
|        1 | Wallace      |    95 |    2 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
|        3 | Shaun        |    85 |    4 |
+----------+--------------+-------+------+

Note that McGraw and Preston got same scores, therefore both share 3rd rank, whereas Shaun gets ranked #4.

The common solution

Following is the SQL for the common solution:

SELECT
  s1.score_id, s1.student_name, s1.score, COUNT(DISTINCT s2.score) AS rank
FROM
  score s1 JOIN score s2 ON (s1.score <= s2.score)
GROUP BY s1.score_id
;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        1 | Wallace      |    95 |    2 |
|        2 | Gromit       |    97 |    1 |
|        3 | Shaun        |    85 |    4 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
+----------+--------------+-------+------+

(The above can by ORDERed at will, more on this later)

What I’m suggesting is that this self join is an overkill. It recalculates over and over what we knew a second before: to get the Preston’s rank, we need to count how many students got score >=92. But when we need to find Shaun’s rank, we re-iterate these, and in addition count those with grades 85..91. We’re reading, re-reading, then re-re-reading (you get the point) the same data all over again. It’s a waste of energy.

Offering a new solution

I propose a simpler solution: do a one-time sorting of rows according to score (descending). The first row in the sorted set should obviously get the score “1”. Now we iterate the rows one by one, and keep a rank variable. Whenever the score remains the same, we just keep on iterating. Whenever the score changes (it can only change in the direction of “downwards”, since we sorted by score. descending), we increment the rank.

Actually, the above explanation makes it sound as if we do this with multiple steps. This is not so. We do this all in one step:

SELECT
  score_id, student_name, score,
  @prev := @curr,
  @curr := score,
  @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
FROM
  score,
  (SELECT @curr := null, @prev := null, @rank := 0) sel1
ORDER BY score DESC
;
+----------+--------------+-------+----------------+----------------+------+
| score_id | student_name | score | @prev := @curr | @curr := score | rank |
+----------+--------------+-------+----------------+----------------+------+
|        2 | Gromit       |    97 |           NULL |             97 |    1 |
|        1 | Wallace      |    95 |             97 |             95 |    2 |
|        4 | McGraw       |    92 |             95 |             92 |    3 |
|        5 | Preston      |    92 |             92 |             92 |    3 |
|        3 | Shaun        |    85 |             92 |             85 |    4 |
+----------+--------------+-------+----------------+----------------+------+

Execution plan comparison

Do we have an index on the score column?

If not, I clearly win. The self join (when there’s more than mere 5 rows, of course) will make for repeated full table scans, thereby making for O(n²).

+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | SIMPLE      | s1    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | s2    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where                     |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+

Whereas the suggested solution requires one filesort. This still means table data can be re-read, but significantly less so: it only takes O(n*log(n)), where the log(n) part is usually very small (and it all depend on sort_buffer_size).

+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL | NULL    | NULL |    1 | Using filesort |
|  1 | PRIMARY     | score      | ALL    | NULL          | NULL | NULL    | NULL |    5 |                |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+

Testing on the somewhat larger sakila.film table (1000 rows is all) on my laptop, it takes 47 seconds for the common query to complete, 0.01 seconds to presented solution (repeatedly, no cache issues).

What happens when we do have an index? Again, I win. Testing on sakila.film now, having added an index on location column:

+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra                                        |
+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | s2    | index | length        | length | 3       | NULL | 1140 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | s1    | ALL   | length        | NULL   | NULL    | NULL | 1140 | Using where                                  |
+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+

The above still needs to do index scan. The GROUP BY requires either sorting or utilizing the PRIMARY KEY, and the execution plan with reversed tables ordering does not improve.

Presented solution utilized index for a single pass, with O(n) complexity:

+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+
| id | select_type | table      | type   | possible_keys | key    | key_len | ref  | rows | Extra          |
+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL   | NULL    | NULL |    1 |                |
|  1 | PRIMARY     | film       | index  | NULL          | length | 3       | NULL | 1140 |                |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL   | NULL    | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+

This is done with a single index pass. Again, on my laptop, I get ~41 seconds for 1st query, 0.01 seconds for the proposed solution.

I also argue that in addition, the results would often be required by order of rank, in which case the common solution must eventually sort anyhow.

Conclusion

To be honest, I’ve seen the self-join solution in so many places: textbooks, training material, online tutorials… Maybe it’s just a silly exercise, perhaps not your daily real-world task; but it’s one of those classic SQL problems. The so-often-repeated common solution is ANSI SQL, for sure, but at what cost?

]]>
https://shlomi-noach.github.io/blog/mysql/sql-ranking-without-self-join/feed 35 1293
7 ways to convince MySQL to use the right index https://shlomi-noach.github.io/blog/mysql/7-ways-to-convince-mysql-to-use-the-right-index https://shlomi-noach.github.io/blog/mysql/7-ways-to-convince-mysql-to-use-the-right-index#comments Thu, 02 Apr 2009 16:06:32 +0000 https://shlomi-noach.github.io/blog/?p=695 Sometimes MySQL gets it wrong. It doesn’t use the right index.

It happens that MySQL generates a query plan which is really bad (EXPLAIN says it’s going to explore some 10,000,000 rows), when another plan (soon to show how was generated) says: “Sure, I can do that with 100 rows using a key”.

A true story

A customer had issues with his database. Queries were taking 15 minutes to complete, and the db in general was not responsive. Looking at the slow query log, I found the criminal query. Allow me to bring you up to speed:

A table is defined like this:

CREATE TABLE t (
  id INT UNSIGNED AUTO_INCREMENT,
  type INT UNSIGNED,
  level TINYINT unsigned,
  ...
  PRIMARY KEY(id),
  KEY `type` (type)
) ENGINE=InnoDB;

The offending query was this:

SELECT id FROM data
WHERE type=12345 AND level > 3
ORDER BY id

The facts were:

  • `t` has about 10,000,000 rows.
  • The index on `type` is selective: about 100 rows per value on average.
  • The query took a long time to complete.
  • EXPLAIN has shown that MySQL uses the PRIMARY KEY, hence searches 10,000,000 rows, filtered “using where”.
  • The other EXPLAIN has shown that by using the `type` key, only 110 rows are expected, to be filtered “using where”, then sorted “using filesort”

So MySQL acknowledged it was generating the wrong plan. The other plan was better by its own standards.

Solving the problem

Let’s walk through 7 ways to solve the problem, starting with the more aggressive solutions, refining to achieve desired behavior through subtle changes.

Solution #1: OPTIMIZE

If MySQL got it wrong, it may be because the table was frequently changed. This affects the statistics. If we can spare the time (table is locked during that time), we could help out by rebuilding the table.

Solution #2: ANALYZE

ANALYZE TABLE is less time consuming, in particular on InnoDB, where it is barely noticed. An ANALYZE will update the index statistics and help out in generating better query plans.

But hold on, the above two solutions are fine, but in the given case, MySQL already acknowledges better plans are at hand. The fact was I tried to run ANALYZE a few times, to no avail.

Solution #3: USE INDEX

Since the issue was urgent, my first thought went for the ultimate weapon:

SELECT id FROM data USE INDEX(type)
WHERE type=12345 AND level > 3
ORDER BY id

This instructs MySQL to only consider the indexes listed; in our example, I only want MySQL to consider using the `type` index. It is using this method that generated the other (good) EXPLAIN result. I could have gone even more ruthless and ask for FORCE INDEX.

Solution #4: IGNORE INDEX

A similar approach would be to explicitly negate the use of the PRIMARY KEY, like this:

SELECT id FROM data IGNORE INDEX(PRIMARY)
WHERE type=12345 AND level > 3
ORDER BY id

A moment of thinking

The above solutions are “ugly”, in the sense that this is not standard SQL. It’s too MySQL specific.

I’ve asked the programmers to do a quick rewrite, and had a few moments to consider: why did MySQL insist on using the PRIMARY KEY. Was it because I’ve asked it for the `id` column only? I rewrote as follows:

SELECT id, type, level FROM data
WHERE type=12345 AND level > 3
ORDER BY id

Nope. EXPLAIN got me the same bad plan. Then it must be the ORDER BY clause:

SELECT id FROM data
WHERE type=12345 AND level > 3

Sure enough, EXPLAIN now  indicates using the `type` index, only reading 110 rows. So MySQL preferred to scan 10,000,000 rows, just so that the rows are generated in the right ORDER, and so no sorting is required, when it could have read 110 rows (where each row is a mere INT) and sort them in no time.

Armed with this knowledge, a few more options come at hand.

Solution #5:Move some logic to the application

At about that point I got a message that the programmers were unable to add the USE INDEX part. Why? They were using the EJB framework, which limits your SQL-like queries to something very generic. Well, you can always drop the ORDER BY part and sort on the application side. That isn’t fun, but it’s been done.

Solution #6: Negate use of PRIMARY KEY

Can we force MySQL to use the `type` index, retain the ORDER BY, and do it all with standard SQL? Sure. The following query does this:

SELECT id, type, level FROM data
WHERE type=12345 AND level > 3
ORDER BY id+0

id+0 is a function on the `id` column. This makes MySQL unable to utilize the PRIMARY KEY (or any other index on `id`, had there been one).

In his book “SQL Tuning“, Dan Tow dedicates a chapter on hints and tips like the above. He shows how to control the use or non-use of indexes, the order by which subqueries are calculated, and more.

Unfortunately, the EJB specification said this was not allowed. You could not ORDER BY a fucntion. Only on normal column.

Solution #7: Make MySQL think the problem is harder than it really is

Almost out of options. Just a moment before settling for sorting on the application side, another issue can be considered: since MySQL was fooled once, can it be fooled again to make things right? Can we fool it to believe that the PRIMARY KEY would not be worthwhile to use? The following query does this:

SELECT id, type, level FROM data
WHERE type=12345 AND level > 3
ORDER BY id, type, level

Let’s reflect on this one. What is the order by which the rows are returned now? Answer: exactly as before. Since `id` is PRIMARY KEY, it is also UNIQUE, so no two `id` values are the same. Therefore, the secondary sorting column is redudant, and so is the following one. We get exactly the same result as “ORDER BY id”.

But MySQL didn’t catch this. This query caused MySQL to say: “Mmmmm. ‘ORDER BY id, type, level’ is not doable with the PRIMARY KEY only. Well, in this case, I had better used the `type` index”. Is this a weakness of MySQL? I guess so. Maybe it will be fixed in the future. But this was the fix that made the day.

]]>
https://shlomi-noach.github.io/blog/mysql/7-ways-to-convince-mysql-to-use-the-right-index/feed 39 695
Two storage engines; different plans, Part II https://shlomi-noach.github.io/blog/mysql/two-storage-engines-different-plans-part-ii https://shlomi-noach.github.io/blog/mysql/two-storage-engines-different-plans-part-ii#comments Fri, 07 Nov 2008 17:55:08 +0000 https://shlomi-noach.github.io/blog/?p=35 In Part I of this article, we have seen how the internal structure of the storage engine’s index can affect an execution plan. We’ve seen that some plans are inherent to the way engines are implemented.

We wish to present a second scenario in which execution plans vary for different storage engines. Again, we will consider MyISAM and InnoDB. Again, we will use the world database for testing. This time, we will see how confident the storage engines are in their index search capabilities.

Many newcomers to databases often believe that an index search is always preferable to full table scan. This is not the case. If I were to look for 10 rows in a 1,000,000 rows table, using an indexed column – I could benefit from an index search. However, if I’m looking for 200,000 rows on that table (that’s 20% of the rows) – an index search can actually be much more expensive than a full table scan.

There are several points to consider here: a full table scan is often close to sequential, whereas an index traversal is not. Not only are the index nodes stored non sequentially, but the links from the index to table data may look like a macaroni plate. Also, the index structure itself is a tree-structure, and it can be shown that the number of pages in the index can be larger than the number of pages in the table. Even for partial index scans, it may be worthwhile to simply scan the table.

The threshold above which table scan is preferred is somewhere between 10% and 30% in common DBMS.

We will consider here a scenario where we index a two-valued column, a simple ‘T’ and ‘F’ enum. “That’s a very poor column to index”, you may say. But what if the ratio between the two values is high? Say, 1000:1? Should there be different search plans for the ‘F’ valued rows and for the ‘T’ valued rows?

Let us duplicate the CountryLanguage table, and make it much larger. We will create a table named “cl”, with some 125K rows.

mysql> SHOW CREATE TABLE CountryLanguage \G
*************************** 1. row ***************************
Table: CountryLanguage
Create Table: CREATE TABLE `CountryLanguage` (
`CountryCode` char(3) NOT NULL default '',
`Language` char(30) NOT NULL default '',
`IsOfficial` enum('T','F') NOT NULL default 'F',
`Percentage` float(4,1) NOT NULL default '0.0',
PRIMARY KEY  (`CountryCode`,`Language`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> CREATE TABLE cl SELECT * FROM CountryLanguage;
Query OK, 984 rows affected (0.02 sec)
Records: 984  Duplicates: 0  Warnings: 0

And now make it very large:

mysql> INSERT INTO cl SELECT * FROM cl;
Query OK, 984 rows affected (0.02 sec)
Records: 984  Duplicates: 0  Warnings: 0

mysql> INSERT INTO cl SELECT * FROM cl;
Query OK, 62976 rows affected (0.08 sec)
Records: 62976  Duplicates: 0  Warnings: 0

mysql> UPDATE cl SET IsOfficial='F';
Query OK, 1265 rows affected (0.23 sec)
Rows matched: 125952  Changed: 1265  Warnings: 0

mysql> UPDATE cl SET IsOfficial='T' WHERE RAND()<0.001;
Query OK, 148 rows affected (0.20 sec)
Rows matched: 148  Changed: 148  Warnings: 0

We have now a large table, where the majority of rows have ‘F’ values for ‘IsOfficial’, and the minority have ‘T’. We shall now add an index on this column, and will then make sure the table is in MyISAM (it may be created with another storage engine, depending on our default engine parameter).

mysql> ALTER TABLE cl ADD INDEX (IsOfficial);
Query OK, 125952 rows affected (0.31 sec)
Records: 125952  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE cl ENGINE=MyISAM;
Query OK, 125952 rows affected (1.21 sec)
Records: 125952  Duplicates: 0  Warnings: 0

Now let us compare the search plans for ‘F’ and for ‘T’.

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='F' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ALL
possible_keys: IsOfficial
key: NULL
key_len: NULL
ref: NULL
rows: 94464
Extra: Using where
1 row in set (0.02 sec)

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='T' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ref
possible_keys: IsOfficial
key: IsOfficial
key_len: 1
ref: const
rows: 138
Extra: Using where
1 row in set (0.00 sec)

What MyISAM decided was that an index search on the ‘F’ rows is useless. A table scan was deemed to be preferable. However, for ‘T’ values rows, the index we created was just fine, and would indeed be used.

InnoDB will state differently.

mysql> ALTER TABLE cl ENGINE=InnoDB;
Query OK, 125952 rows affected (1.07 sec)
Records: 125952  Duplicates: 0  Warnings: 0

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='F' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ref
possible_keys: IsOfficial
key: IsOfficial
key_len: 1
ref: const
rows: 61667
Extra: Using where
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT * FROM cl WHERE IsOfficial='T' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cl
type: ref
possible_keys: IsOfficial
key: IsOfficial
key_len: 1
ref: const
rows: 148
Extra: Using where
1 row in set (0.00 sec)

On the ‘T’ search, MyISAM and InnoDB agree. But look at the plan for the ‘F’ rows: InnoDB still prefers an index search to table scan, even though it estimates a lookup on 50% of the rows.

The behavior just exposed is not entirely consistent. InnoDB and MyISAM differ in the way they update the index statistics. While ANALYZE TABLE on MyISAM performs an exaustive search on index values, InnoDB will only do 10 random test dives and return with a rough calculation. In fact, InnDB’s estimations can greatly vary from the real values distribution, and successive calls to ANALYZE table can produce varying results.

What has been presented in this part is not a rule to live by. You shouldn’t base your queries or expected behavior on the index distribution or search plan calculated by the storage engine. These may change in time. What’s instructive here is the freedom MySQL gives the storage engines in decision making, and the different actions taken when dealing with different engines.

]]>
https://shlomi-noach.github.io/blog/mysql/two-storage-engines-different-plans-part-ii/feed 3 35
Two storage engines; different plans, Part I https://shlomi-noach.github.io/blog/mysql/two-storage-engines-different-plans-part-i https://shlomi-noach.github.io/blog/mysql/two-storage-engines-different-plans-part-i#comments Sat, 01 Nov 2008 16:36:29 +0000 https://shlomi-noach.github.io/blog/?p=9 A popping question is: “Can an execution plan change for different storage engines?”

The answer is “Yes”. I will present two such cases, where the MySQL optimizer will choose different execution plans, based on our choice of storage engine.

We will consider MyISAM and InnoDB, the two most popular engines. The two differ in many respects, and in particular, the way they implement indexes and statistics: two major players in the optimizer’s point of view.

Let’s start with the famous world database, available from dev.mysql.com. All tables in this schema are defined as MyISAM. We will alter them between MyISAM and InnoDB as we go along.

A peek at the Country table reveals:

mysql> SHOW CREATE TABLE Country \G
*************************** 1. row ***************************
Table: Country
Create Table: CREATE TABLE `Country` (
`Code` char(3) NOT NULL default '',
`Name` char(52) NOT NULL default '',
`Continent` enum('Asia','Europe','North America','Africa','Oceania','Antarctica','South America') NOT NULL default 'Asia',
...
PRIMARY KEY (`Code`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

To see the first example of execution plan difference, we will add an index on the Country table:

ALTER TABLE Country ADD INDEX (Continent);

And run the following query to find European country codes:

mysql> SELECT Code FROM Country WHERE Continent = 'Europe';
+------+
| Code |
+------+
| NLD |
| ALB |
| AND |
| BEL |
| BIH |
| GBR |
...

But how is this query executed?

mysql> EXPLAIN SELECT Code FROM Country WHERE Continent = 'Europe'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Country
type: ref
possible_keys: Continent
key: Continent
key_len: 1
ref: const
rows: 37
Extra: Using where
1 row in set (0.00 sec)

Simple enough: we asked for European countries only. MySQL has found the index on Continent to be appropriate. However, to get the actual Code, a table row read was necessary.

InnoDB will provide a different plan, though:

mysql> ALTER TABLE Country ENGINE=InnoDB;
Query OK, 239 rows affected (0.18 sec)
Records: 239 Duplicates: 0 Warnings: 0

mysql> EXPLAIN SELECT Code FROM Country WHERE Continent = 'Europe'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Country
type: ref
possible_keys: Continent
key: Continent
key_len: 1
ref: const
rows: 46
Extra: Using where; Using index
1 row in set (0.00 sec)

Can you spot the difference? The “Extra” column now indicates “Using index” (The numbers of expected rows also differ, but that’s another issue).

The reason for this change lies with the way MyISAM and InnoDB implement indexes. MyISAM takes the approach where the table data resides in its own space (and in fact, its own file), and all indexes refer to rows in that space. MyISAM is using nonclustered indexes.

InnoDB, however, uses a clustered index on the PRIMARY KEY. That is, for every table there is always a PRIMARY KEY index (even if we never defined one), and table data is aggregated withing the index’ structure. And so, to access table rows, one must first traverse the PRIMARY KEY index. This type of index is called a “clustered index”. The Code column is the primary key, and therefore the data is clustered on the Code column.

InnoDB’s secondary indexes behave altogether differently. A secondary index does not refer to the table rows directly, but instead refer to the PRIMARY KEY value, which relates to those rows. A table look up using a secondary key involves a search on that key, only to get a PRIMARY KEY value, and search on that clustered index as well. A side effect is that a secondary index includes the values of the PRIMARY KEY. Each secondary index, like the one we created on Continent, is somewhat a compound index, like on (Continent, Code). This is the reason that for our query, a search on the index was enough. There was no need to access table data, since all relevant data could be found within the index.

I say “somewhat”, because in contrast with an index on (Continent, Code), the index does not necessarily store the PRIMARY KEY values in any particular order. To prove this, let’s ask the following:

mysql> EXPLAIN SELECT Code FROM Country WHERE Continent = 'Europe' ORDER BY Code\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Country
type: ref
possible_keys: Continent
key: Continent
key_len: 1
ref: const
rows: 46
Extra: Using where; Using index; Using filesort
1 row in set (0.00 sec)

There’s a “Using filesort” comment in the “Extra” column, which would not be there had we used a compound index on (Continent, Code).

]]>
https://shlomi-noach.github.io/blog/mysql/two-storage-engines-different-plans-part-i/feed 2 9