This is a second in a series of posts describing our experience in migrating a large DWH server to TokuDB. This post discusses the process of migration itself.
As a quick recap (read part 1 here), we have a 2TB compressed InnoDB (4TB uncompressed) based DWH server. Space is running low, and we're looking at TokuDB for answers. Early experiments show that TokuDB's compression could make a good impact on disk space usage. I'm still not discussing performance -- keeping this till later post.
Those with weak hearts can skip right to the end, where we finally have a complete conversion. You can also peek at the very end to find out how much 4TB uncompressed InnoDB data is worth in TokuDB. But you might want to read through. The process was not smooth, and not as expected (it's a war story thing). Throughout the migration we got a lot of insight on TokuDB's behaviour, limitations, conveniences, inconveniences and more.
Disclosure: I have no personal interests and no company interests; throughout the process we were in touch with Tokutek engineers, getting free, friendly & professional advice and providing with input of our own. Most of this content has already been presented to Tokutek throughout the process. TokuDB is open source and free to use, though commercial license is also available.
How do you convert 4TB worth of data to TokuDB?
Obviously one table at a time. But we had another restriction: you may recall I took a live slave for the migration process. And we wanted to end the process with a live slave. So the restriction was: keep it replicating!
How easy would that be? Based on our initial tests, I extrapolated over 20 days of conversion from InnoDB to TokuDB. Even with one table at a time, our largest table was expected to convert in some 12-14 days. Can we retain 14 days of binary logs on a server already running low on disk space? If only I knew then what I know today
I recently claimed (as I recall it was in one of the @DBHangops meetings) I was done with ALTER TABLE statements. I would not touch them again: with oak-online-alter-table and pt-online-schema-change I can get better control of my server (and my sleep). Can I use an online migration tool?
Fortunately we are using Statement Based Replication on this MySQL topology. This makes for good news, because triggers are activated on slave as it is replicating its master's statements. You can run an online migration tool on the slave, while it keeps replicating. This is so cool. I don't need to worry too much about binary logs and relay logs. I can take my time with conversion...
I chose to use pt-online-schema-change. Why choose Percona's tool over my own? Percona's tool supports the --check-slave-lag option, which throttles the operation should the server's slaves start lagging. Wait, I'm running the tool on a slave, so what's the point? Well, you can cheat and provide --check-slave-lag=h=127.0.0.1 so that the tool assumes the localhost is the slave (while it is actually the server being altered); which means it will check on its own slave lag to do the throttling. This works well and is fun to watch.
Starting the migration
Some of our tables had the KEY_BLOCK_SIZE explicitly declared. As I mentioned in previous post, for TokuDB <= 7.0.4 this causes problems by bloating the indexes instead of compressing them (and Tim Callaghan of Tokutek notes this is fixed in next version). common_schema to the rescue: the ALTER statement has to include a DROP and recreate of all indexes.
This is the place to mention our tables are heavily partitioned. This will play a crucial role in the next events. We use RANGE partitions over dates. We have a monthly partitioning scheme on all partitioned tables. And we have partitions to spare: starting a few years back from today (depending on the table) and made until Dec. 2025 -- making for some 170 - 200 partitions per table.
Starting from smallest table (a few rows) and increasing in size, we migrated tables one by one to TokuDB.
A few party crashers were made obvious right from the start (they are all being addressed by Tokutek as far as I know):
- @@datadir: all TokuDB files reside in @@datadir. You get a pile of files in the same directory where you would find your ib_logfile*, master.info, etc. files.
- File names: you do not get file names after table names. Instead, you get cryptic names like ./_myschema_sql_7fda_8e73_*.You would suspect that there is some uniqueness to the 7fda_8e73 thing; that it relates to a single table -- it doesn't. Same table get different file names, different tables get similar names -- there's not one regular expression to differentiate tables -- and I do know my regexes.
- INFORMATION_SCHEMA.Tokudb_file_map doesn't make it much easier, either. It is meant to tell you about tables to file names mappings. But the tables are not laid out in easy TABLE_SCHEMA, TABLE_NAME columns, but are denormalized themselves, and can be vague and almost ambiguous, to some extent. common_schema to the rescue, its rewritten tokudb_file_map maps your tables to aggregated list of file names, along with shell commands you would typically want to issue.
But let me emphasize these issues, before you think I'm just spoiled. TokuDB creates multiple files per table: other than the standard .frm file, you get one "main" file for each table, and then another file for each index. So it's quite possible your table will consist of some 7 files.
Well, as you may know, MySQL's implementation of partitioning is that each partition is made of its own standalone table, hidden from the user (but easily viewable on the filesystem). Now this means a single table with 170 partitions and a few indexes can make for over 1,000 files. That's right - for a single table. We have a few dozens like this.
- You may need to accommodate thousands or tens of thousands of files in your @@datadir
- None of which it is easy for you to know who to relate to.
So common_schema's tokudb_file_map gives you this crazy list of 1,000 files which make up your single table. This isn't too friendly, either.
I think Tokutek are missing here on one of the greatest advantages they bring to the table. The one thing a DBA want to know when looking at her MySQL's filesystem is: how much disk space is used by a schema/table. And this information becomes hard to get. Again, common_schema's view will provide you with the script to do it (du -c ... | tail -1) -- but you would have to go into MySQL, out to shell... Bother.
Crashes and failures
The above did not stop at inconveniences. Soon enough, and while still altering my smaller tables, I would get failures from pt-online-schema-change. Checking manually to eliminate the possibility of a bug in Percona's tool, I got:
mysql> alter table my_schema.my_table engine=tokudb row_format=tokudb_small; ERROR 1016 (HY000): Can't open file: './my_schema/#sql-80d_2.frm' (errno: 24)
This would happen again and again and again. What's the deal here?
Let me cut short on this one: we got over 20,000 files in @@datadir. And MySQL was unable to open any more files. Mind you, we had:
Quite the generous numbers (also backed up by ulimit, to be on the safe side; and also note we're using XFS file system). And yet, open files were an issue. To prove my point, it was possible to ALTER a table with a fewer number of partitions. It was then possible to ALTER another smaller table without partitions. It was then impossible to alter any additional partitioned table. Once I dropped partitioning for some very small table some room was made and I was again able to ALTER a partitioned table... And this would happen for completely empty tables -- no data involved. We were on some file limit here.
Another such ALTER and the server crashed. This was quite unceremonious. The error log produced nothing; no stack trace. Zit.
I was fervently querying the Tokudb_file_map to get a picture of what's going on. I would need to do a self join on the table (as common_schema's view does) to get a per-table listing of files. This would occasionally crash the server. I guess I had 3 or 4 such crashes.
On this I feel I'm quite the veteran here :D. TokuDB recovery works well. As with InnoDB, TokuDB recognize there has been a crash, and before allowing MySQL to open connections it restores to a stable state.
ALTER TABLE in TokuDB
Here I found two comforting features (the third and fourth yet to be discovered). As indicated above, I did turn to issue a manual ALTER TABLE. What I found was:
- A TokuDB ALTER TABLE statement lets you know its progress. This is no little thing! Your SHOW PROCESSLIST output shows messages like "Fetched about 312724000 rows, loading data still remains", or "Loading of data about 66.1% done", or "Queried about 33113001 rows, Inserted about 33113000 rows". Cool!
- Even better, the crash I had during the ALTER TABLE? I thought that would be the end of it. If you ever had a crash while ALTERing an InnoDB table, you know how it goes: InnoDB will forever complain about some table existing but not listed (or the other way around). And don't get me started with DISCARD TABLESPACE; when InnoDB decides it is upset about something - you cannot convince it otherwise.
Thankfully, TokuDB completely reverted the ALTER operation. It removed what temporary files were created (further notes following) and forgot all about it. No complaints, no ghosts. Great! Back to consistency!
What do we do now?
Throwing my hands up in the air, having worked on this for many days, I thought to myself: OK, I still have this server all to myself. If TokuDB is not going to work out, I have some time to come up with a sharding/commercial solution. Let's use up this time and learn something about TokuDB. And I decided to re-create all tables without partitions. My colleague argued that she was not ready to give up on partitioning altogether and we decided to try again with YEARly partitioing scheme. This would reduce number of files by factor of 12. Also, 2025 is so far away, we agreed to settle for 2015. So reducing number of files by factor of 25-30.
And this made all the difference in the world. Having reduced number of files made the impact we were hoping for. Suddenly all went well. No crashes, no weird complaints, little proliferation of files in @@datadir.
And I did notice that a manual ALTER TABLE went considerably faster than I would expect. And by far faster than the pt-online-schema-change pace. I tried a couple more -- sure thing. ALTERing a table from InnoDB to TokuDB is fast.
- I converted a 47GB InnoDB COMPRESSED table to TokuDB in 73 minutes. By the way, resulting table size measured 3.4GB.
- A 330GB InnoDB COMPRESSED table converted to TokuDB took little over 9 hours. I dare you alter 600GB worth of uncompressed data into InnoDB (COMPRESSED) in less than a few days. It went down to 31GB.
- And our largest, 1TB COMPRESSED table (2TB worth of uncompressed data)? There's yet another story here.
Altering 1 (2 uncomressed) TB of data
Here's a tip that will save you some exhaustion: SET tokudb_load_save_space := 1.
While ALTERing our largest table, I was concerned to find our disk space was running low. Plenty temporary TokuDB files were created. I assumed these would consume only so much disk space, but to my surprise they accumulated and accumulated... It turns out for ALTERing a table TokuDB creates the equivalent of the table in temporary files, and only then generates the new table. This means you need to have enough room for your own original table, the equivalent in temporary files, and your new table altogether.
With great compression that would be nothing. However you'll be surprised to learn that by default those temporary files are not compressed. Thus, the ALTER operation consumed more than 1.3TB of disk space in temporary files, until I had no choice and (36 hours into the operation) had to KILL it before it consumed the entire 3TB of disk space.
Setting the variable as specified and the next attempt was far more successful: the temporary files were created with same compression algorithm as target table, which left with a lot of free space to work with.
ALTER time took about 40 hours.
Well, what's the resulting size?
And we were finally done! It took the better part of three weeks to work through all the pitfalls, the pt-online-schems-change attempts, the crashes, the tests, the no-partitions, the YEARly partitions... Finally we are with a TokuDB version of our data warehouse.
Suspension is over. We got from 2TB of InnoDB COMPRESSED (KEY_BLOCK_SIZE=8) down to 200GB of TokuDB_SMALL (aka agressive, aka lzma) tables.
I mean, this is beyond expectations. It is ridiculously small. From 80% disk space utilization down to 8% disk space utilization. Absolutely ridiculous!
- TokuDB does not play well with many partitions.
- Crashes encountered. Recovery is fine.
- Good ALTER TABLE experience
- SET tokudb_load_save_space := 1
- Great compression (x20 from uncompressed InnoDB; x10 from KEY_BLOCK_SIZE=8)
In the following post I'll share some observations on how well our newly converted TokuDB slave performs as compared to our equivalent InnoDB slaves; some configuration you might care about; and some things you can do with TokuDB that would be so very painful with InnoDB. Stay tuned!