InnoDB – code.openark.org

The mystery of MySQL 5.6 excessive buffer pool flushing

shlomi — Sun, 20 Apr 2014 05:16:13 +0000

I’m experimenting with upgrading to MySQL 5.6 and am experiencing an unexplained increase in disk I/O utilization. After discussing this with several people I’m publishing in the hope that someone has an enlightenment on this.

We have a few dozens servers in a normal replication topology. On this particular replication topology we’ve already evaluated that STATEMENT based replication is faster than ROW based replication, and so we use SBR. We have two different workloads on our slaves, applied by two different HAProxy groups, on three different data centres. Hardware-wise, servers of two groups use either Virident SSD cards or normal SAS spindle disks.

Our servers are I/O bound. A common query used by both workloads looks up data that does not necessarily have a hotspot, and is very large in volume. DML is low, and we only have a few hundred statements per second executed on master (and propagated through replication).

We have upgraded 6 servers from all datacenters to 5.6, both on SSD and spindle disks, and are experiencing the following phenomena:

A substantial increase in disk I/O utilization. See a 10 day breakdown (upgrade is visible on 04/14) this goes on like this many days later:

A substantial increase in InnoDB buffer pool pages flush: Mr. Blue is our newly upgraded server; it joins Mr. Green upgraded a couple weeks ago. Mr. Red is still 5.5. This is the only MySQL graph that I could directly relate to the increase in I/O:
No change in workload (it takes some 60 minutes for caches to warm up, so measuring after that time). Same equal share of serving as dictated by HAProxy. Same amount of queries. Same amount of everything.
Faster replication speed, on single thread – that’s the good part! We see 30% and more improvement in replication speed. Tested by stopping SLAVE SQL_THREAD for a number of pre-defined minutes, then measuring time it took for slave to catch up, up to 10 seconds lag. The results vary depending on the time of day and serving workload on slaves, but it is consistently far faster with 5.6.

The faster replication speed motivates us to continue with the experiment, and is of a significant factor in our decision. However we are concerned about the I/O utilization and excessive flushing.

The above graphs depict the 5.6 status without any configuration changes as compared to 5.5. I took some days to reconfigure the following variables, with no change to the rate of flushed pages (though some changes visible in double-wite buffer writes):

innodb_log_file_size=128M/2G
innodb_adaptive_flushing:=0/1
innodb_adaptive_flushing_lwm:=0/70
innodb_max_dirty_pages_pct := 75/90
innodb_flush_neighbors:=0/1
innodb_max_dirty_pages_pct_lwm:=75/90
innodb_old_blocks_time:=0/1000
innodb_io_capacity:=50/100/200
innodb_io_capacity_max:=50/100/1000
relay_log_info_repository:=’table’/’file’
master_info_repository:=’table’/’file’
default_tmp_storage_engine:=’myisam’/’innodb’
eq_range_index_dive_limit:=0/10

And more… Have done patient one-by-one or combinations of the above where it made sense. As you see I began with the usual suspects and moved on to more esoteric stuff. I concentrated on new variables introduced in 5.6, or ones where the defaults have changed, or ones we have explicitly changed the defaults from.

The above is consistent on all upgraded servers. On SSD the disk utilization is lower, but still concerning.

Our use case is very different from the one presented by Yoshinori Matsunobu. and apparently not too many have experienced upgrading to 5.6. I’m hoping someone might shed some light.

TokuDB configuration variables of interest

shlomi — Wed, 23 Oct 2013 17:42:12 +0000

During our experiments I came upon a few TokuDB variables of interest; if you are using TokuDB you might want to look into these:

tokudb_analyze_time

This is a boundary on the number of seconds an ANALYZE TABLE will operate on each index on each partition on a TokuDB table.

That is, if tokudb_analyze_time = 5, and your table has 4 indexes (including PRIMARY) and 7 partitions, then the total runtime is limited to 5*4*7 = 140 seconds.

Default in 7.1.0: 5 seconds

tokudb_cache_size

Similar to innodb_buffer_pool_size, this variable sets the amount of memory allocated by TokuDB for caching pages. Like InnoDB the table is clustered within the index, so the cache includes pages for both indexes and data.

Default: 50% of total memory

tokudb_directio

Boolean, values are 0/1. Setting tokudb_directio = 1 is like specifying innodb_flush_method = O_DIRECT. Which in turn means the OS should not cache pages requested by TokuDB. Default: 0.

Now here’s the interesting part: we are used to tell InnoDB to get the most memory we can provide (because we want it to cache as much as it can) and to avoid OS caching (because that would mean a page would appear both in the buffer pool and in OS memory, which is a waste). So the following setup is common:

innodb_buffer_pool_size = [as much as you can allocate while leaving room for connection memory]G
innodb_flush_method = O_DIRECT

And my first instinct was to do the same for TokuDB. But after speaking to Gerry Narvaja of Tokutek, I realized it was not that simple. The reason TokuDB’s default memory allocation is 50% and not, say, 90%, is that OS cache caches the data in compressed form, while TokuDB cache caches data in uncompressed form. Which means if you limit the TokuDB cache, you allow for more cache to the OS, that is used to cache compressed data, which means more data (hopefully, pending duplicates) in memory.

I did try both options and did not see an obvious difference, but did not test this thoroughly. My current setup is:

#No setup. just keep to the default for both:
#tokudb_cache_size
#tokudb_directio

tokudb_commit_sync

tokudb_fsync_log_period

These two variable are similar in essence to innodb_flush_log_at_trx_commit, but allow for finer tuning. With innodb_flush_log_at_trx_commit you choose between syncing the transaction log to disk upon each commit and once per second. With tokudb_commit_sync = 1 (which is default) you get transaction log sync to disk per commit. When tokudb_commit_sync = 0, then tokudb_fsync_log_period dictates the interval between flushes. So a value of tokudb_fsync_log_period = 1000 means once per second.

Since our original InnoDB installation used innodb_flush_log_at_trx_commit = 2, our TokuDB setup is:

tokudb_commit_sync = 0
tokudb_fsync_log_period = 1000

tokudb_load_save_space

Turned on (value 1) by default as of TokuDB 7.1.0, this parameter decides whether temporary file created on bulk load operations (e.g. ALTER TABLE) are compressed or uncompressed. Do yourself a big favour (why? read here) and keep it on. Our setup is:

tokudb_load_save_space = 1

TokuDB’s general recommendation is: don’t change the variables; the engine should work well right out of the box. I like the approach (by MySQL 5.5 I already lost count of InnoDB variables that can have noticeable impact; with 5.6 I’m all but lost). The complete list of configuration variables is found in TokuDB’s Users Guide.

Converting an OLAP database to TokuDB, part 3: operational stuff

shlomi — Mon, 14 Oct 2013 10:03:43 +0000

This is the third post in a series of posts describing our experience in migrating a large DWH server to TokuDB (see 1st and 2nd parts). This post discusses operations; namely ALTER TABLE operations in TokuDB. We ran into quite a few use cases by this time that we can shed light on.

Quick recap: we’ve altered one of out DWH slaves to TokuDB, with the goal of migrating most of out servers, including the master, to TokuDB.

Adding an index

Shortly after migrating our server to TokuDB we noticed an unreasonably disproportionate slave lag on our TokuDB slave (red line in chart below) as compared to other slaves.

Quick investigation led to the fact that, coincidentally, a manual heavy-duty operation was just taking place, which updated some year’s worth of data retroactively. OK, but why so slow on TokuDB? Another quick investigation led to an apples vs. oranges problem: as depicted in part 1, our original setup included MONTHly partitioning on our larger tables, whereas we could not do the same in TokuDB, where we settled for YEARly partitioning.

The heavy-duty operation included a query that was relying on the MONTHly partitioning to do reasonable pruning: a WHERE condition on a date column did the right partition pruning; but where on InnoDB that would filter 1 month’s worth of data, on TokuDB it would filter 1 year.

Wasn’t it suggested that TokuDB has online table operations? I decided to give it a shot, and add a proper index on our date column (I actually created a compound index, but irrelevant).

It took 13 minutes to add an index on a 1GB TokuDB table (approx. 20GB InnoDB uncompressed equivalent):

The ALTER was non blocking: table was unlocked at that duration
The client issuing the ALTER was blocked (I thought it would happen completely in the background) — but who cares?
I would say 13 minutes is fast

Not surprisingly adding the index eliminated the problem altogether.

Modifying a PRIMARY KEY

It was suggested by our DBA that there was a long time standing need to modify our PRIMARY KEY. It was impossible to achieve with our InnoDB setup (not enough disk space for the operation, would take weeks to complete if we did have the disk space). Would it be possible to modify our TokuDB tables? On some of our medium-sized tables we issued an ALTER of the form:

ALTER TABLE my_table DROP PRIMARY KEY, ADD PRIMARY KEY (c1, c2, c3, ...);

Time-wise the operation completed in good time. We did note, however, that the disk space consumed by the new table was doubled. Was it due to the fact we added two columns to our PK? Did that account for the bloated space? I did not believe so, and decided to rebuild the table:

OPTIMIZE TABLE my_table

Nope. Space not reduced. However we were unconvinced and asked. As usual, we got quick response from the Tokutek team; this was a bug: while our original table used the TOKUDB_SMALL row format (high compression), the table rebuild reset it to TOKUDB_FAST (normal compression), which makes for roughly twice the file size. The bug was filed as: alter table operations that rebuild the table lose the original tokudb compression.

Now, we were altering the PRIMARY KEY. We were not expecting an online operation anyhow, and didn’t mind blocking the table; hence the solution was simple: make sure to spceify the row format:

ALTER TABLE my_table DROP PRIMARY KEY, ADD PRIMARY KEY (c1, c2, c3, ...) ENGINE=TokuDB ROW_FORMAT=TOKUDB_SMALL;

This worked in terms of disk space — but we only later realized it would still make us trouble.

Modifying a PRIMARY KEY on our largest table

We moved on to our largest table: originally 1TB InnoDB COMPRESSED, worth of 2TB uncompressed. With TokuDB it went down to 100GB. Converting this table to TokuDB took about 40 hours, which is just fast. We issued an ALTAR TABLE modifying the PRIMARY KEY as above and waited.

The operation did not complete after 40 hours. Nor after 3 days. By day 4 we thought we might look into this. Fortunately, TokuDB is friendly on SHOW PROCESSLIST and provides you with useful information, such as “Fetched about 1234567890 rows, loading data still remains“. Yikes! We extrapolated the values to realize it would take 2 weeks to complete! Weekend went by and we decided to find a better way. Again, posting on the tokudb-user group, we got a definitive answer: a table rebuild does not utilize the bulk loader (you really want to be friends with the bulk loader, it’s the process that loads your data quickly).

And so we chose to KILL the ALTER process and go another way; again, KILLs are very easy with TokuDB ALTER operations: took 3 minutes to abort this week old operation. The alternative operation was:

CREATE TABLE my_table_New LIKE my_table;
ALTER TABLE my_table_New DROP PRIMARY KEY, ADD PRIMARY KEY (c1, c2, c3, ...) ENGINE=TokuDB ROW_FORMAT=TOKUDB_SMALL;
INSERT INTO my_table_New SELECT * FROM my_table;
RENAME TABLE my_table TO my_table_Old, my_table_New TO my_table;
DROP TABLE my_table_Old;

The INSERT INTO … SELECT operation does use the bulk loader when you do it on an empty table. It completed within merely 30 hours. Hurrah!

DROPping a TABLE

It was an immediate operation to drop our “Old” table — subsecond. Nothing like your InnoDB DROP.

Converting an OLAP database to TokuDB, part 1

shlomi — Tue, 03 Sep 2013 07:04:12 +0000

This is the first in a series of posts describing my impressions of converting a large OLAP server to TokuDB. There’s a lot to tell, and the experiment is not yet complete, so this is an ongoing blogging. In this post I will describe the case at hand and out initial reasons for looking at TokuDB.

Disclosure: I have no personal interests and no company interests; we did get friendly, useful and free advice from Tokutek engineers. TokuDB is open source and free to use, though commercial license is also available.

The case at hand

We have a large and fast growing DWH MySQL setup. This data warehouse is but one component in a larger data setup, which includes Hadoop, Cassandra and more. For online dashboards and most reports, MySQL is our service. We populate this warehouse mainly via Hive/Hadoop. Thus, we have an hourly load of data from Hive, as well as a larger daily load.

There are some updates on the data, but the majority of writes are just mysqlimports of Hive queries.

Usage of this database is OLAP: no concurrency issues here; we have some should-be-fast-running queries issued by our dashboards, as well as ok-to-run-longer queries issued for reports.

Our initial and most burning trouble is with size. Today we use COMPRESSED InnoDB tables (KEY_BLOCK_SIZE is default, i.e. 8). Our data volume sums right now at about 2TB. I happen to know this translates as 4TB of uncompressed data.

However growth of data is accelerating. A year ago we would capture a dozen GB per month. Today it is a 100GB per month, and by the end of this year it may climb to 150GB per month or more.

Our data is not sharded. We have a simple replication topology of some 6 servers. Machines are quite generous as detailed following. And yet, we will be running out of resources shortly: disk space (total 2.7TB) is now running low and is expected to run out in about six months. One of my first tasks in Outbrain is to find a solution to our DWH growth problem. The solution could be sharding; it could be a commercial DWH product; anything that works.

The approach we experiment with

It was at my initial interview that I suggested TokuDB might be a good solution, with the primary reason of being so good with compression. And we decided to experiment with this simple (setup-wise) solution of compression. If we could compress the data even by 50%, that would buy us considerable time. And it’s the simplest approach as we would need to change nothing at the application side, nor add additional frameworks.

Of course, we were already using InnoDB COMPRESSED tables. How about just improving the compression? And here I thought to myself: we can try KEY_BLOCK_SIZE=4, which I know would generally compress by 50% as compared to KEY_BLOCK_SIZE=8 (not always, but in many use cases). We’re already using InnoDB so this isn’t a new beast; it will be “more of the same”. It would work.

I got myself a dedicated machine: a slave in our production topology I am free to play with. I installed TokuDB 7.0.1, later upgraded to 7.0.3, based on MySQL 5.5.30.

The machine is a Dell Inc. PowerEdge R510 machine, with 16 CPUs @ 2.1 GHz and 126 GiB RAM, 16 GiB Swap. OS is CentOS 5.7, kernel 2.6.18. We have RAID 10 over local 10k RPM SAS disks (10x600GB disks)

How to compare InnoDB & TokuDB?

2TB of compressed data (for absolute measurement I consider it to be a 4TB worth of data) is quite a large setup. How do I do the comparison? I don’t even have too much disk space here…

We have tables of various size. Our largest is in itself 1TB (2TB uncompressed) – half of the entire volume. The rest ranging 330GB, 140GB, 120GB, 90GB, 50GB and below. We have MONTHly partitioning schemes on most tables and obviously on our larger tables.

For our smaller tables, we could just CREATE TABLE test_table LIKE small_table, populating it and comparing compression. However, the really interesting question (and perhaps the only interesting question compression-wise) is how well would our larger (and specifically largest) tables would compress.

Indeed, for our smaller tables we saw between 20% to 70% reduction in size when using stronger InnoDB compression: KEY_BLOCK_SIZE=4/2/1. How well would that work on our larger tables? How much slower would it be?

We know MySQL partitions are implemented by actual independent tables. Our testing approach was: let’s build a test_table from a one month worth of data (== one single partition) of our largest table. We tested:

The time it takes to load the entire partition (about 120M rows, 100GB COMPRESSED data as seen on .idb file)
The time it would take to load a single day’s worth of data from Hive/Hadoop (loading real data, as does our nightly import)
The time it would take for various important SELECT query to execute on this data.

InnoDB vs. TokuDB comparison

In this post I will only describe our impressions of compression size. I have a lot to say about TokuDB vs InnoDB partitioning and queries; this will wait till later post.

So here goes:

Engine	Compression	Time to Insert 1 month	Table size (optimized)	Time to import 1 day
InnoDB	8k	10.5h	58GB	32m
InnoDB	4k	48h	33GB	unknown (too long)
TokuDB	quicklz	14h	17GB	40m
TokuDB	lzma (small/aggresive)	15h	7.5GB	42m

Some comments and insights:

Each test was performed 3-4 times. There were no significant differences on the various cycles.
The 1 month insert was done courtesy QueryScript split, 5,000 rows at a time, no throttling.
The 1 day import via mysqlimport. There were multiple files imported. Each file is sorted by PRIMARY KEY ASC.
Isn’t it nice to know that your 100GB InnoDB table actually fits within 58GB when rebuilt?
For InnoDB flush_logs_at_trx_commit=2, flush_method=O_DIRECT.
I used default configuration to TokuDB — touched nothing. More on this in later post.
InnoDB 4k was prohibitively slow to load data. It was so slow so as to be unacceptable. For the 1 day load it took 1 hour for a mere 20% of data to load. 1 hour was already marginal for our requirements; waiting for 5 hours was out of the question. I tested several times, never got to wait for completion. Did I say it would just be “more of the same”? 4k turned to be “not an option”.
I saw almost no difference in load time between the two TokuDB compression formats. Both somewhat (30%) longer than InnoDB to load, but comparable.
TokuDB compression: nothing short of amazing.

With InnoDB 4k being “not an option”, and with both TokuDB compressions being similar in load time yet so different in compression size, we are left with the following conclusion: if we want to compress more than our existing 8k (and we have to) – TokuDB’s agressive compression (aka small, aka lzma) is our only option.

Shameless plug

common_schema turned to be quite the “save the day” tool here. Not only did we use it to extract 100GB of data from a large dataset and load it onto our tables, it also helped out in the ALTER process for TokuDB: at this time (<= 7.0.4) TokuDB still has a bug with KEY_BLOCK_SIZE: when this option is found in table definition, it impacts TokuDB’s indexes by bloating them. This is how sql_alter_table_tokudb was born. Hopefully it will be redundant shortly.

More to come

Was our test fair? Should we have configure TokuDB differently? Is loading via small 5,000 row chunks the right way?

In the next post I will describe the process of migrating our 4TB worth of data to TokuDB, pitfalls, issues, party crushers, sport spoilers, configuration, recovery, cool behaviour and general advice you should probably want to embrace. At later stage I’ll describe how our DWH looks after migration. Finally I’ll share some (ongoing) insights on performance.

You’ll probably want to know “How much is (non compressed) 4TB of data worth in TokuDB?” Let’s keep the suspense

Easy SELECT COUNT(*) with split()

shlomi — Sat, 08 Jun 2013 04:41:13 +0000

The two conservative ways of getting the number of rows in an InnoDB table are:

SELECT COUNT(*) FROM my_table:
provides with an accurate number, but makes for a long running transaction which take ages on large tables. Long transactions make for locks
SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA=’my_schema’ AND TABLE_NAME=’my_table’, or get same info via SHOW TABLE STATUS.
Gives immediate response, but the value can be way off; it can be two times as large as real value, or half the value. For query execution plans this may be a “good enough” estimation, but typically you just can’t trust it for your own purposes.

Get a good estimate using chunks

You can get a good estimate by calculating the total number of rows in steps. Walk the table 1,000 rows at a time, and keep a counter. Each chunk is its own transaction, so, if the table is modified while counting, the final value does not make for an accurate account at any point in time. Typically this should be a far better estimate than TABLE_ROWS.

QueryScript’s split() construct provides you with the means to work this out. Consider this script:

set @total := 0;

split(SELECT COUNT(*) FROM world.City INTO @chunk) {
  set @total = @total + @chunk;
}

select @total;

split() breaks the above SELECT COUNT(*) into distinct chunks, like:

SELECT COUNT(*) FROM world.City WHERE ((((`City`.`ID` > '3000'))) AND (((`City`.`ID` < '4000')) OR ((`City`.`ID` = '4000')))) INTO @chunk

You can make this a one liner like this:

call common_schema.run(“set @total := 0;split(SELECT COUNT(*) FROM world.City INTO @chunk) set @total = @total + @chunk; select @total;”);

If you like to watch the progress, add some verbose:

call common_schema.run("set @total := 0;split(SELECT COUNT(*) FROM world.City INTO @chunk) {set @total = @total + @chunk; select $split_step, @total} select @total;");

QueryScript is available via common_schema.

Converting compressed InnoDB tables to TokuDB 7.0.1

shlomi — Wed, 05 Jun 2013 07:10:12 +0000

Or: how to make it work in TokuDB version 7.0.1. This is a follow up on a discussion on the tokudb-user group.

Background

I wanted to test TokuDB’s compression. I took a staging machine of mine, with production data, and migrated it from Percona Server 5.5 To MariaDB 5.5+TokuDB 7.0.1. Migration went well, no problems.

To my surprise, when I converted tables from InnoDB to TokuDB, I saw an increase in table file size on disk. As explained by Tim Callaghan, this was due to TokuDB interpreting my compressed table’s “KEY_BLOCK_SIZE=4” as an instruction for TokuDB’s page size. TokuDB should be using 4MB block size, but thinks it’s being instructed to use 4KB. Problem is, you can’t get rid of table options. When one converts a table to InnoDB in ROW_FORMAT=COMPACT, or even to MyISAM, the KEY_BLOCK_SIZE option keeps lurking in the dark.

So until this is hopefully resolved in TokuDB’s next version, here’s a way to go around the problem.

The case at hand

Consider the following table:

 CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c1` int(10) unsigned NOT NULL DEFAULT '0',
  `c2` int(10) unsigned NOT NULL DEFAULT '0',
  `c3` int(10) unsigned NOT NULL DEFAULT '0',
  `c4` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `c5` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `c6` smallint(10) unsigned NOT NULL DEFAULT '0',
  `c7` smallint(10) unsigned NOT NULL DEFAULT '0',
  `c8` smallint(10) unsigned NOT NULL DEFAULT '0',
  `c9` smallint(10) unsigned NOT NULL DEFAULT '0',
  `c10` smallint(10) unsigned NOT NULL DEFAULT '0',
  `c11` smallint(10) NOT NULL DEFAULT '0',
  `c12` smallint(10) NOT NULL DEFAULT '0',
  `c13` smallint(10) NOT NULL DEFAULT '0',
  `c14` smallint(10) NOT NULL DEFAULT '0',
  `ct` text NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `c1c4` (`c1`,`c4`),
  KEY `c4` (`c4`)
) ENGINE=InnoDB AUTO_INCREMENT=4688271 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4

Note that it is in COMPRESSED format, with KEY_BLOCK_SIZE=4. It mostly has INT columns, so I don’t expect it to compress by much.

On disk, the .ibd file amounts to 160MB. Table has 3,587,488 rows. Same table in InnoDB COMPACT row format amounts to 412MB on disk.

Converting the table to TokuDB with aggressive compression resulted with:

mysql> alter table t engine=tokudb row_format=tokudb_lzma;
Query OK, 3587488 rows affected (29 min 48.79 sec)
Records: 3587488  Duplicates: 0  Warnings:

And over 873MB of combined files on disk! Also note it took nearly 30 minutes to ALTER. Clearly this is not the expected outcome.

Attempt to make it work

I tried both the following approaches:

alter table t engine=tokudb row_format=tokudb_lzma key_block_size=4096: thought it would fool TokuDB to think it should create a 4M key block size.
alter table t engine=tokudb row_format=tokudb_lzma key_block_size=0: try and reset the key block size.

Both the above attempts resulted with same bloat in resulting table.

The reason? When ALTERing a table with a nother KEY_BLOCK_SIZE, the keys on the table remain with their old KEY_BLOCK_SIZE. They are unaffected by the ALTER. As suggested by Nail Kashapov, indexes must be rebuilt as well.

Making it work

The next ALTER modifies the KEY_BLOCK_SIZE and rebuilds all the indexes on the table:

mysql> alter table t drop primary key, add primary key(id), drop key c1c4, add unique key `c1c4` (c1, c4), drop key c4, add key `c4` (c4), engine=tokudb row_format=tokudb_lzma key_block_size=0;
Query OK, 3587488 rows affected (2 min 7.97 sec)
Records: 3587488  Duplicates: 0  Warnings: 0

Yep! Runtime seems much more agreeable. Total size on disk? Little over 26M. Did I say I wasn’t expecting good reduction in terms of compression?

Have done the same for multiple tables; compression is consistently strong (e.g. 16MB InnoDB compressed -> 3.5MB TokuDB aggressive, 548MB InnoDB non-compressed -> 36MB TokuDB aggressive), on varying table schemata. Very impressive reduction in disk space!

Conclusion

Next version of TokuDB is expected to ignore the KEY_BLOCK_SIZE table option; until then converting compressed tables to TokuDB is a pain in terms of the syntax — but worthwhile in terms of disk space.

State of InnDB Online DDL in MySQL 5.6.9-RC (good news included)

shlomi — Tue, 18 Dec 2012 11:21:12 +0000

5.6.9-RC is out, and I was curious to see how the online DDL has improved since my 5.6.8 review. I also owe James Day this review, since he came up with results inconsistent with my own.

We both agreed the dataset I was using was too small, but I got similar results even on larger scale. Then some time passed, and 5.6.9 was announced.

So for the 5.6.9 test I took one of my real tables on production. It is not extremely large: it’s a ~ 300MB .ibd file, in the following format:

mysql> show create table tbl \G

CREATE TABLE `tbl` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `a` varchar(255) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `w` smallint(11) NOT NULL DEFAULT '0',
  `d` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `icount` smallint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`) KEY_BLOCK_SIZE=8,
  UNIQUE KEY `u_idx` (`a`,`w`,`d`) KEY_BLOCK_SIZE=8,
  KEY `d` (`d`) KEY_BLOCK_SIZE=8
) ENGINE=InnoDB AUTO_INCREMENT=16960441 DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=16

Got some 2.5M rows in the table; desktop machine, 64 bit Linux, mysqlsandbox.

I have crossed several DDL statements with several DML statements. The DDL statements in this test are (ALTER TABLE…):

ROW_FORMAT=COMPACT
AUTO_INCREMENT=16960441
ADD INDEX (w)
DROP INDEX w
ADD COLUMN c CHAR(1) NOT NULL
DROP COLUMN c

The DML statements are:

select max(id) from test.tbl; — this queries the AUTO_INCREMENT value, which is of course a PRIMARY KEY
select min(d) from test.tbl; — there is an index on d, and normal execution plan is to optimize table away and just use the index
select min(icount) from test.tbl; — there is no index on icount, and full table scan is required
update test.tbl set d = d + interval 1 second where id = 8057370; — the UPDATE uses the PRIMARY KEY
update test.tbl set d = d + interval 1 second where icount = 200; — will affect 4 rows, but requires full scan.

The results?

ALTER TABLE…	Time (sec)	General comments	select max PK	select min by index	select min by full scan	update by PK	update by full scan
ROW_FORMAT=COMPACT	183		online	online	online	online	online
AUTO_INCREMENT=16960441	0.24	[Instant operation]	n/a	n/a	n/a	n/a	n/a
ADD INDEX (w)	21		online	online	online	online	online
DROP INDEX w	0.1	[Instant operation]	n/a	n/a	n/a	n/a	n/a
ADD COLUMN c CHAR(1) NOT NULL	103		online	online	online	online	online
DROP COLUMN c	110		online	online	online	online	online

Notes

All operations were online: operations did not wait for ALTER to complete.
I executed all operations multiple times during each ALTER.
In addition, I executed operations from another client.
Some operations were fast, others sometimes took as long as 7.34 seconds to complete. This is no small matter: the time it took for each DML was indeterministic, and longer than what it would usually take it. That’s perfectly understandable. Just note that some operations took exceedingly long time to complete. My understanding is that the ALTER operations happens in chunks. DML statements are allowed in between these chunks. This is the reason why on smaller tables there didn’t seem to be any “online” statement: the chunks were just too large in relation to table size. And so, and this is still my own understanding, your query may get lucky or unlucky depending on the exact moment it has been issued.
I did not try it with FOREIGN KEYs. I previously concluded that foreign keys were a no-go for online DDL. I’m not sure if this is still the case. Another time for this test – but it must take place.

Conclusions

Still RC – but for the first time the online DDL seem to deliver what’s promised. I’m very happy to see this.

I am yet to understand how the ALTER works via replication. With single threaded replication I would assume it’s back to “wait till I’m done” on the slave, in which case the “online” term is not there yet. Even on multi-threaded replication DML on same schema would hang. I’m happy to be corrected on this by an authority.

My predicament is that oak-online-alter-table or pt-online-schema-change are here to stay for the next couple of years at least. Some operations, like partitioning, are not supported by current online InnoDB DDL. Also, these scripts allow you some control over the speed at which the ALTER process works, allowing for pre-defined sleep time in between chunks, so as to let the server – and its slaves – recover their breath.

Nonetheless, big kudos for the InnoDB team at Oracle for pulling this one out!

Killing InnoDB idle transactions

shlomi — Tue, 04 Dec 2012 12:23:12 +0000

The issue of terminating long-time idle open InnoDB transaction has been discussed recently by many. I wish to add my share, by proposing a quick and clean solution via common_schema.

common_schema 1.2 provides with the innodb_transactions view, which relies on INNODB_TRX – one of the InnoDB Plugin views in INFORMATION_SCHEMA – as well as on PROCESSLIST, and so is able to determine with certainty that a transaction has been idle for a long time.

innodb_transactions offers us with a sql_kill_query column, which produces a ‘KILL QUERY 12345’ type of value. So we can:

SELECT sql_kill_query FROM innodb_transactions WHERE trx_idle_seconds >= 10; 
+-------------------+
| sql_kill_query    |
+-------------------+
| KILL QUERY 292509 |
| KILL QUERY 292475 |
+-------------------+

common_schema‘s useful eval() routine allows us to actually invoke those KILL statements, all in a one-liner:

call eval('SELECT sql_kill_query FROM innodb_transactions WHERE trx_idle_seconds >= 10');

Technical details

trx_idle_seconds notes the time, in seconds, the transaction has been idle, or 0 if the transaction is not idle at all.
sql_kill_query is a self-generated SQL query which kills the running query, e.g. ‘KILL QUERY 12345’.
eval() takes a query as text, retrieves the SQL resulting column, and executes it live.

Background details

The connection between INNODB_TRX and PROCESSLIST is not synchronous. It is possible that by the time one is querying INNODB_TRX, PROCESSLIST data may change (e.g. next query is already replacing the one you were considering in INNODB_TRX). But in our case it is of little consequence: we are interested in transactions that have been idle for quite some time. Say, 10 seconds. So we are not troubled by having 200 queries per second changing under our hands.

If the transaction has been asleep for 10 seconds, and we decide to kill it, well, it is possible that just as we kill it it will turn active again. It’s a risk we take no matter what kind of solution we apply, since there’s no atomic “get-status-and-kill” operation on InnoDB transactions.

The above solution is manual: one must invoke the query which kills the idle transactions. This is as opposed to a built-in server feature which does the same. Events can used to semi-automate this: one can call upon this query once every 10 seconds, for example.

See the many related and inspiring solutions below:

State of InnDB Online DDL in MySQL 5.6.8-RC

shlomi — Tue, 20 Nov 2012 09:49:14 +0000

5.6.8-rc is out, and so I’m following up on InnoDB’s online DDL new feature: the ability to SELECT, INSERT, DELETE, UPDATE a table even while an ALTER TABLE is executing on same table.

The brief summary

Not as advertised; many things can’t be done.

The longer review

I’m using 5.6.8-rc 64bit binary distribution for Linux, installed via mysqlsandbox. My hardware is irrelevant, but the fact I’m testing on my laptop assists me in that ALTER TABLE operations take a while, so that I’m able to easily type commands in two terminals and have the time to watch them being executed. Query cache is disabled.

I’m using the sakila sample database, and in particular I’m working with the rental table. Here’s the table definition:

CREATE TABLE `rental` (
  `rental_id` int(11) NOT NULL AUTO_INCREMENT,
  `rental_date` datetime NOT NULL,
  `inventory_id` mediumint(8) unsigned NOT NULL,
  `customer_id` smallint(5) unsigned NOT NULL,
  `return_date` datetime DEFAULT NULL,
  `staff_id` tinyint(3) unsigned NOT NULL,
  `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`rental_id`),
  UNIQUE KEY `rental_date` (`rental_date`,`inventory_id`,`customer_id`),
  KEY `idx_fk_inventory_id` (`inventory_id`),
  KEY `idx_fk_customer_id` (`customer_id`),
  KEY `idx_fk_staff_id` (`staff_id`),
  CONSTRAINT `fk_rental_staff` FOREIGN KEY (`staff_id`) REFERENCES `staff` (`staff_id`) ON UPDATE CASCADE,
  CONSTRAINT `fk_rental_inventory` FOREIGN KEY (`inventory_id`) REFERENCES `inventory` (`inventory_id`) ON UPDATE CASCADE,
  CONSTRAINT `fk_rental_customer` FOREIGN KEY (`customer_id`) REFERENCES `customer` (`customer_id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=16050 DEFAULT CHARSET=utf8

Highlights for the table: AUTO_INCREMENT PRIMARY KEY, some columns indexed, some not, and Foreign Keys in place. Pretty much a standard table. It contains 16,044 rows. Row format is COMPACT.

What I want to know is: which DDL commands allow for which online DML commands?

So, on terminal #1 I will issue queries like:

node1 5.6.8-rc-log sakila> alter table sakila.rental ROW_FORMAT=COMPACT /* or whatever */;
Query OK, 0 rows affected (10.57 sec)
Records: 0  Duplicates: 0  Warnings: 0

And during the above operation, I will execute the following on terminal #2:

select max(rental_id) from sakila.rental; this queries the AUTO_INCREMENT value, which is of course a PRIMARY KEY
select min(rental_date) from sakila.rental; there is an index on rental_date, and normal execution plan is to optimize table away and just use the index
select min(return_date) from sakila.rental; there is no index on return_date, and full table scan is required
update rental set return_date = return_date + interval 1 second where rental_id=3; the UPDATE uses the PRIMARY KEY
update rental set return_date = return_date + interval 1 second where return_date = NOW(); won’t actually affect anything, but requires full scan.

So here are the results:

+-------------------------------------------------------------+-------+---------------------------+---------------+---------------------+-------------------------+--------------+---------------------+
| ALTER statement                                             | Time  | General comments          | select max PK | select min by index | select min by full scan | update by PK | update by full scan |
+-------------------------------------------------------------+-------+---------------------------+---------------+---------------------+-------------------------+--------------+---------------------+
| ROW_FORMAT=COMPACT                                          | 10.92 |                           | Instant       | Instant             | Instant                 | blocked      | blocked             |
| AUTO_INCREMENT=16051                                        |  0.06 | Instant, no table rebuild | N/A           | N/A                 | N/A                     | N/A          | N/A                 |
| ADD INDEX(last_update)                                      |  2.37 |                           | blocked       | blocked             | blocked                 | blocked      | blocked             |
| ADD INDEX(last_update), ALGORITHM=INPLACE                   |  1.83 |                           | blocked       | blocked             | blocked                 | blocked      | blocked             |
| ADD INDEX(last_update), ALGORITHM=INPLACE, LOCK=NONE        |  0.00 | ERROR 1235 (42000): ...   | N/A           | N/A                 | N/A                     | N/A          | N/A                 |
| ADD COLUMN c CHAR(1) NOT NULL                               | 11.20 |                           | Instant       | Instant             | Instant                 | blocked      | blocked             |
| ADD COLUMN c CHAR(1) NOT NULL, ALGORITHM=INPLACE, LOCK=NONE |  0.00 | ERROR 1235 (42000): .     | N/A           | N/A                 | N/A                     | N/A          | N/A                 |
+-------------------------------------------------------------+-------+---------------------------+---------------+---------------------+-------------------------+--------------+---------------------+

Rather surprising, I would say.

None of my tests resolved with online write (UPDATE). At best I could get online read (SEELCT).
AUTO_INCREMENT is instantaneous. High time for that! It’s just some number in the .frm file, never understood the need for table rebuild.
Apparently ADD COLUMN is more online than ADD INDEX, and I’ve tested this again and again and again to make sure I was doing it right. This is quite weird, even according to the docs.
In none of the above tests (and others, non listed), have I been able to specify LOCK=NONE. It’s always ERROR 1235 (42000): This version of MySQL doesn’t yet support ‘alter table sakila.rental , algorithm=inplace, lock=none’.

So what’s so online about this? Online reads are nice, but most everyone cannot accept blocking writes (for same reason no one would use mysqlhotcopy, also so wrongly named). This leaves us again with oak-online-alter-table and pt-online-schema-change.

The butler did it

Apologies to the butler, the FOREIGN KEYs did it. Let’s try the same again without foreign keys:

node1 5.6.8-rc-log sakila> create table rental2 like rental;
node1 5.6.8-rc-log sakila> insert into rental2 select * from rental;
node1 5.6.8-rc-log sakila> rename table rental to rental_old, rental2 to rental;
Query OK, 0 rows affected (0.31 sec)

Here are the results:

+-------------------------------------------------------------+-------+---------------------------+----------------+---------------------+-------------------------+----------------+---------------------+
| ALTER statement                                             | Time  | General comments          | select max PK  | select min by index | select min by full scan | update by PK   | update by full scan |
+-------------------------------------------------------------+-------+---------------------------+----------------+---------------------+-------------------------+----------------+---------------------+
| ROW_FORMAT=COMPACT                                          | 11.03 |                           | Instant        | Instant             | Instant                 | Instant        | Instant             |
| AUTO_INCREMENT=16051                                        |  0.05 | Instant, no table rebuild | N/A            | N/A                 | N/A                     | N/A            | N/A                 |
| ADD INDEX(last_update)                                      |  2.04 |                           | * Inconsistent | * Inconsistent      | * Inconsistent          | blocked        | blocked             |
| ADD INDEX(last_update), ALGORITHM=INPLACE, LOCK=NONE        |  3.14 |                           | * Inconsistent | * Inconsistent      | * Inconsistent          | blocked        | blocked             |
| ADD COLUMN c CHAR(1) NOT NULL                               |    ** |                           | * Inconsistent | * Inconsistent      | * Inconsistent          | * Inconsistent | * Inconsistent      |
| ADD COLUMN c CHAR(1) NOT NULL, ALGORITHM=INPLACE, LOCK=NONE |    ** |                           | * Inconsistent | * Inconsistent      | * Inconsistent          | * Inconsistent | * Inconsistent      |
+-------------------------------------------------------------+-------+---------------------------+----------------+---------------------+-------------------------+----------------+---------------------+

What’s going on here?

ALGORITHM=INPLACE, LOCK=NONE is accepted! Bad, bad foreign keys!
* ADD INDEX usually allows for concurrent reads, but after repeated tests SELECTs start to block. Then they don’t work concurrently anymore until table is recreated. But even that not always, so I’m not sure what the inconsistency is.
* ADD COLUMN is still more concurrent than ADD INDEX, and actually allows for concurrent writes! Though, inconsistently. Sometimes it does not allow for concurrent writes.
** ADD COLUMN runtime highly affected by concurrent queries. It wents as high as 45 seconds on my laptop. Now, to make things clear, I’m not running an automated benchmark here: I’m copying+pasting the statements from my editor to the mysql CLI. So, maybe 10 or 15SELECT and UPDATE queries executes. How does that justify 35 seconds delay in table rebuild?

Some conclusions:

The documentation does not specify anything about FOREIGN KEYs crashing the party. It should.
The documentation specifically mentions the ADD/DROP INDEX statements to be online. ADD INDEX is less online than ADD COLUMN.
Everything is still shaky. Sometimes things work, sometimes they don’t.
Runtimes are unproportionally affected by concurrent queries.
For the meantime, I keep to my online alter table scripts. Been using them for 3.5 years now.

InnoDB DDL: kudos to quick responders on bugs.mysql.com

shlomi — Thu, 18 Oct 2012 16:55:29 +0000

Continuing my experiments with 5.6 InnoDB online DDL, a bug which I’ve opened, and another which I commented on were quickly answered and explained by the Oracle/MySQL team.

On both accounts I’m happy to acknowledge the issue is resolved; in both cases I failed to produce a real bug scenario. Good lesson. Kudos for quick and informative responses!

What’s left of my experiment, then? Still a lot to check.

I am mainly still confused with which operations exactly can use LOCK=NONE (allowing for updated to table while ALTERing). So far I am only able to produce ALTERs with LOCK=SHARED, meaning table is readable, but cannot be updated.

I will want to test speeds. I’ve so far been content with slow response times for queries over altered tables. How well will that endure under heavy load?