The two conservative ways of getting the number of rows in an InnoDB table are:
- SELECT COUNT(*) FROM my_table:
provides with an accurate number, but makes for a long running transaction which take ages on large tables. Long transactions make for locks - SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA=’my_schema’ AND TABLE_NAME=’my_table’, or get same info via SHOW TABLE STATUS.
Gives immediate response, but the value can be way off; it can be two times as large as real value, or half the value. For query execution plans this may be a “good enough” estimation, but typically you just can’t trust it for your own purposes.
Get a good estimate using chunks
You can get a good estimate by calculating the total number of rows in steps. Walk the table 1,000 rows at a time, and keep a counter. Each chunk is its own transaction, so, if the table is modified while counting, the final value does not make for an accurate account at any point in time. Typically this should be a far better estimate than TABLE_ROWS.
QueryScript’s split() construct provides you with the means to work this out. Consider this script:
set @total := 0; split(SELECT COUNT(*) FROM world.City INTO @chunk) { set @total = @total + @chunk; } select @total;
split() breaks the above SELECT COUNT(*) into distinct chunks, like:
SELECT COUNT(*) FROM world.City WHERE ((((`City`.`ID` > '3000'))) AND (((`City`.`ID` < '4000')) OR ((`City`.`ID` = '4000')))) INTO @chunk
You can make this a one liner like this:
call common_schema.run(“set @total := 0;split(SELECT COUNT(*) FROM world.City INTO @chunk) set @total = @total + @chunk; select @total;”);
If you like to watch the progress, add some verbose:
call common_schema.run("set @total := 0;split(SELECT COUNT(*) FROM world.City INTO @chunk) {set @total = @total + @chunk; select $split_step, @total} select @total;");
QueryScript is available via common_schema.
What’s the reason for WHERE ((((`City`.`ID` > ‘3000’))) AND (((`City`.`ID` < '4000')) OR ((`City`.`ID` = '4000')))) ?
Why not City.ID BETWEEN 3000 AND 3999 ? (and then start the next chunk at 4000)
Honest question
@Jannes,
Good question.
There are a couple reasons:
1. BETWEEN is inclusive on both edges (a < = x AND x <= b); except for the very first chunk, I want it only half-inclusive (a < x AND x <= b). 2. More importantly, this is but the simple case of a more complex scenario where the key may be compound (see example in http://code.openark.org/blog/mysql/how-common_schema-splits-tables-internals )
In such case you get something like this:
WHERE ((((`film_actor`.`actor_id` > '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` > '513'))) AND (((`film_actor`.`actor_id` < '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` < '278')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` = '278'))));
(3). It is worth mentioning that MySQL does not do a good job at doing a range comparison using a compound key:
http://code.openark.org/blog/mysql/mysql-not-being-able-to-utilize-a-compound-index
Which is the reason I need to break the condition into so many clauses.
Hmmm, I should also note that getting the value 3999 makes for an extra effort, since I will need to also get hold of 4000.
This sounds silly, but only when the numbers are sequential. If the numbers are sparse, then the number before 4000 may well be 2975, for all we know.
Thanks, I knew there would be a good reason.
1 I knew, but didn’t think of 2. Still find it amazing that they haven’t fixed 3 after all these years.
And now that I see how you actually find a minimum and maximum value such that the chunk size is 1000, I understand your remark about sparse too.