Comments on: How common_schema split()s tables – internals https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals Blog by Shlomi Noach Fri, 07 Sep 2012 03:42:46 +0000 hourly 1 https://wordpress.org/?v=5.3.3 By: shlomi https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals/comment-page-1#comment-117697 Fri, 07 Sep 2012 03:42:46 +0000 https://shlomi-noach.github.io/blog/?p=5035#comment-117697 @Rick,

Thanks. I didn’t feel like you were saying “openark-kit is wrong”, all is cool and I appreciate your comments 🙂

split() does not work within transaction scope: it breaks a transaction. I have presented the “what it looks like”; but the multiple UPDATE statements do not all run within the same transaction. Thank you for pointing out.

As per OR vs. AND query, I made this one due to inefficiency of MySQL handling compound column comparison, e.g.:
WHERE (a,b) >= (6,13) AND (a,b) < (19,3) For some reason MySQL is unable to use an index for the above. My tests show good evaluation plans and index usage for such query conditions. I will check for handler_%. Thanks for suggesting this.

]]>
By: Rick James https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals/comment-page-1#comment-117673 Thu, 06 Sep 2012 22:33:10 +0000 https://shlomi-noach.github.io/blog/?p=5035#comment-117673 This comment may be more important — It is about the inefficient optimization of OR versus AND (at least in older versions of MySQL.

You have:

WHERE
((actor_id > 1) OR (actor_id = 1 AND film_id > 1) OR (actor_id = 1 AND film_id = 1))
AND
((actor_id < 39) OR (actor_id = 39 AND film_id = 1) AND (film_id >= 1 OR actor_id > 1)
AND
(actor_id <= 39) AND (actor_id < 39 OR film_id <= 293)

By playing with De Morgan's law, I have bubbled ANDs to the top, thereby letting it bound the scan over actor_id BETWEEN 1 and 39. Within 1 and 39, it still has to further filter down by film_id.

Run EXPLAIN. Check the delta for VARIABLES LIKE 'Handler%'.

(My chunking technique avoids the need to skip over unnecessary film_id values in the main query.)

]]>
By: Rick James https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals/comment-page-1#comment-117665 Thu, 06 Sep 2012 20:41:49 +0000 https://shlomi-noach.github.io/blog/?p=5035#comment-117665 > Your solution (1 column) misses the case where there aren’t 1000 rows.
“The devil is in the details.” Also misses the last chunk.

> unbalanced chunks.
I don’t see that as a problem. In ‘real’ cases, I find that the average tends to be over 90% of the requested chunk size (1000).

In extreme cases, some chunks might have only 1 row.

My algorithm does not care how what the ordering is (int, string, etc) done — because ORDER BY, LIMIT, WHERE …>… take care of it.

N columns — the WHERE clause id dynamically created and has a variable number of ANDs in it.

Nothing sacred about 1000. However, it should not be “too big” if the queries will be replicated — else other queries will be delayed in the single-threaded replication stream.

I was merely pointing out a different way to ‘skin the cat’, not saying that openark-kiit is in any way ‘wrong’.

One thing to add to your UPDATE example (if using InnoDB). If the set of UPDATEs in wrapped in BEGIN..COMMIT, much of the savings is lost. Without the transaction, ACID across the set is lost. It is a tradeoff that the user should be aware of.

]]>
By: shlomi https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals/comment-page-1#comment-117653 Thu, 06 Sep 2012 17:41:06 +0000 https://shlomi-noach.github.io/blog/?p=5035#comment-117653 @Rick,

Your solution (1 column) misses the case where there aren’t 1000 rows. The LIMIT 1000,1 will return an empty result set.

In the 2-column type, what you are suggesting leads to unbalanced chunks. You will sometimes get 1000 rows, sometimes less, sometimes get many single ones… you have to test beforehand: “does this value have more than 100 rows? Does that one? Does the next one? If they don’t, we aggregate…”.

This is much more work than the way I present, where I use the same technique to detect “the last of the next 1000 rows by order of key”.

Although using a temporary table, the fact we’re only examining 1000 rows makes it small. I’ve been using this technique for over 3 years on production (and so has anyone using the openark-kit, and in particular oak-chunk-update and oak-online-alter-table), and found it to be satisfying. The index is utilized well for looking up the 1000 rows. These are then sorted in reverse order without an index. This solution makes for a consistent algorithm, which does not care about your density, and does not need to diagnose it (avoiding additional lookup SELECTs overhead).

How would you solve a 2-column case where one of the columns is not an integer? A 3-column case? The algorithm used in common_schema & openark-kit doesn’t care. They all work the same way.

]]>
By: Rick James https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals/comment-page-1#comment-117650 Thu, 06 Sep 2012 17:18:19 +0000 https://shlomi-noach.github.io/blog/?p=5035#comment-117650 This is, in my opinion, a better way to ‘chunk’ an ID that is not necessarily dense:

$idz = SELECT id FROM tbl WHERE id > $ida LIMIT 1000,1;

Then you proceed to do
SELECT … WHERE id >= $ida AND id <= $idz

The first query can (in some situations) operate only in the index, thereby not slogging through all the data. And it does not need to build a temp table.

In the 2-column case, things get messier. I prefer to discover which case exists and act accordingly:
WHERE col1 = 123 AND col2 BETWEEN… — when 123 has more than 1000 rows.
WHERE col1 BETWEEN… — no mention of col2

]]>