Problem: you need to purge old rows from a table. This may be your weekly/monthly cleanup task. The table is large, the amount of rows to be deleted is large, and doing so in one big DELETE is too heavy.
You can use oak-chunk-update or pt-archiver to accomplish the task. You can also use server side scripting with QueryScript, offering a very simple syntax with no external scripting, dependencies and command line options.
I wish to present three cases of row deletion, with three different solutions. In all cases we assume some TIMESTAMP column exists in table, by which we choose to purge the row. In all cases we assume we wish to purge rows older than 1 month.
We assume the naive query is this:
DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH
Case 1: TIMESTAMP column is indexed
I almost always index a timestamp column, if only for being able to quickly purge data (but usually also to slice data by date). In this case where the column is indexed, it’s very easy to figure out which rows are older than 1 month.
We break the naive query into smaller parts, and execute these in sequence:
while (DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH ORDER BY row_timestamp LIMIT 1000)
throttle 1;
How does the above work?
QueryScript accepts a DELETE statement as a conditional expression in a while loop. The expression evaluates to TRUE when the DELETE affects rows. Once the DELETE ceases to affect rows (when no more rows match the WHERE condition), the while loop terminates.
The throttle command allows us to play nice: by throttling we increase the total runtime through sleeping in between loop iterations.
Case 2: TIMESTAMP column is not indexed, and there is no heuristic for matching rows
This case is hardest to tackle by means of optimization: there is no index, and we cannot assume or predict anything about the distribution of old rows. We must therefore scan the entire table so as to be able to purge old rows.
This does not mean we have to do one huge full table scan. As long as we have some way to split the table, we are still good. We can utilize the PRIMARY KEY or another UNIQUE KEY so as to break the table into smaller, distinct parts, and work our way on these smaller chunks:
split (DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH)
throttle 1;
The split statement will automagically calculate the chunks and inject filtering conditions onto the query, such that each execution of the query relates to a distinct set of rows.
Case 3: TIMESTAMP column not indexed, but known to be monotonic
This is true for many tables. Rows with AUTO_INCREMENT columns and TIMESTAMP columns are created with CURRENT_TIMESTAMP values. This makes for a monotonic function: as the AUTO_INCREMENT grows, so does the TIMESTAMP.
This makes for the following observation: it we iterate the table row by row, and reach a point where the current row is not old, then we can stop looking. Timestamps will only increase by value, which means further rows only turn to be newer.
With this special case at hand, we can:
split (DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH) {
if ($split_rowcount = 0)
break;
throttle 1;
}
split is a looping device, and a break statement works on split just as on a while statement.
split provides with magic variables which describe current chunk status. $split_rowcount relates to the number of rows affected by last chunk query. No more rows affected? This means we’ve hit recent rows, and we do not expect to find old rows any further. We can stop looking.