DELETE, don’t INSERT

Have just read INSERT, Don’t DELETE by Aaron Brown, and have some lengthy response, which is why I write this post instead of commenting on said post.

I wish to offer my counter thought and suggest that DELETEs are probably the better choice.

Aaron suggests that, when one wishes to purge rows from some table, a trick can be used: instead of DELETEing unwanted rows, one can INSERT “good” rows into a new table, then switch over with RENAME (but please read referenced post for complete details).

I respectfully disagree on several points discussed.

Lockdown

The fact one needs to block writes during the time of creation of new table is problematic: you need to essentially turn off parts of your application. The posts suggests one could use a slave – but this solution is far from being trivial as well. To switch over, you yet again need to turn off access to DB, even if for a short while.

A switch over to a slave is quite a big deal, in my opinion, for the mere purpose of deletion of rows.

DELETEs are easy

The DELETEs are so much easier: the first thing to note is the following: You don’t actually have to delete all the rows *at once*.

You just need to drop some rows, right? Why waste a huge transaction that takes minutes, when you can drop the rows by chunks, one at a time?
For that, you can use either pt-archive from Percona Toolkit, oak-chunk-update from openark-kit, or write a simple QueryScript code with common_schema:

while (DELETE FROM title WHERE title <= 'g' LIMIT 1000)
{
  throttle 1;
}

So, drop 1,000 rows or so at a time, then sleep some time, etc. The total runtime is longer, but who cares? The impact can be reduced to be unnoticeable.

Space reclaim

You can use online table operations to rebuild your table and reclaim the disk space. Either see oak-online-alter-table or pt-online-schema-change. Again, both work in small chunks, so no long stalls.

But more on this: my usual purge scenario shows that it is repetitive. You purge, data fills again, you purge again, and so on.

Which is why it doesn’t make much sense to rebuild the table and reclaim the disk space: it just grows again to roughly same dimensions.
For a one time operation (e.g. after neglect of cleanup for long time) — yes, absolutely, do a rebuild and reclaim. For repetitive cleanup – I don’t bother.

Conclusion

Aaron does make note at the end of his post that DELETE operations can be done online, while the INSERT trick requires downtime, and this is a fair assessment.

But just to make a point: none of the DELETE timings are interesting. Since we are not concerned with deleting the rows in a given time (no “press the red button”), we can spread them over time and make the impact negligible. So not only is everything done online, it also goes unnoticed by the user. And this, I believe, is the major thing to consider.

6 thoughts on “DELETE, don’t INSERT

  1. Due to the fact that large tables are usually continually used by a live system, I found that lock-delays of a few seconds or more (minutes, hours) are unacceptable. Therefore I took a different approach to this problem.

    I’ve created a script that:
    – creates a new table MY_TABLE_NEW
    – creates a merge-table MY_TABLE_MERGE combining MY_TABLE_NEW and MY_TABLE whereby inserts are done in MY_TABLE_NEW
    – renames MY_TABLE into MY_TABLE_OLD and at the same time MY_TABLE_MERGE into MY_TABLE

    This effectively creates a merge table with all the data, still accessible through the name MY_TABLE where all new records are moved into MY_TABLE_NEW – which is optimized of course as it is empty.

    Then the script slowly starts moving records from MY_TABLE to MY_TABLE. It takes a record (or fifty), deletes them and inserts them. That effectively moves them from MY_TABLE_OLD to MY_TABLE_NEW. That slowly builds a MY_TABLE_NEW which is optimized, while the system continues to work.

    The chances of the system needing a specific record while it’s being deleted and inserted are very, very small with the tables I’ve used. Of course that depends highly on the way your system processes data. But for me, it worked flawlessly.

    I’ve used this script dozens of times on production systems. It’s coping with interruptions, can be run in parallel on several tables at the same time and has virtually no downsides.

    Of course you cannot merge merge-tables, so when you already have merge-tables, this solution may not be for you.

    If you’d like to take a look at the script, let me know in a comment below.

    Cheers,
    Eelko de Vos

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.