Merging tables with INSERT…ON DUPLICATE KEY UPDATE

Had a case recently where I had to merge data from two identically structured tables containing nearly identical data.

“Nearly identical” meaning most table data is identical in both; sometimes a row is missing from one of the tables; sometimes same row (according to PK) appears in both, but some columns are NULL is one tables (while others can be NULL in the second).

Otherwise no contradicting data: it was not possible for some data to be “3” in one table and “4” in the other.

How do you create a merge of the tables, such that all missing rows are completed, and NULLs replaced by actual values when possible?

pt-table-sync comes to mind: one can do a bidirectional syncing of two tables, and actually stating how to resolve ambiguities (like “greater value wins”). Very powerful! An example would be:

pt-table-sync --bidirectional --conflict-column=a --conflict-comparison=greatest --tables ...

However I didn’t actually have any problem with the tables themselves. The two tables were just fine as they were; missing or NULL data does not indicate an error on their part. I wanted to get their merge. pt-table-sync is still up for the job: we can duplicate them, merge on the copy… But I prefer a query over an external script when possible.

INSERT…ON DUPLICATE KEY UPDATE

This MySQL-specific syntax is actually quite powerful. It basically says “if the insert fails due to unique constraint, you get a chance to update the row causing the failure”. But it also allows for smart setting of the column via the VALUES() clause. Let’s present some sample data and then see the solution.

Assume the following table definition:

create table t1 (
  pkdt datetime,
  pki int,
  a int,
  b int,
  c int,
  primary key (pkdt, pki)
);

Same structure holds for t1, t2 and tmerge — our target table. Looking at table data we have:

> select * from t1;
+---------------------+-----+------+------+------+
| pkdt                | pki | a    | b    | c    |
+---------------------+-----+------+------+------+
| 2012-01-01 00:00:00 |   1 |    4 |    5 |    6 |
| 2012-01-02 00:00:00 |   2 | NULL |    5 |    6 |
| 2012-01-04 00:00:00 |   4 |    4 | NULL | NULL |
| 2012-01-05 00:00:00 |   5 | NULL |    8 | NULL |
+---------------------+-----+------+------+------+

> select * from t2;
+---------------------+-----+------+------+------+
| pkdt                | pki | a    | b    | c    |
+---------------------+-----+------+------+------+
| 2012-01-01 00:00:00 |   1 |    4 | NULL |    6 |
| 2012-01-03 00:00:00 |   3 |    4 |    5 |    6 |
| 2012-01-04 00:00:00 |   4 |    4 |    5 |    6 |
| 2012-01-05 00:00:00 |   5 |    7 | NULL |    9 |
+---------------------+-----+------+------+------+

We can see a row with pki=2 appears in t1 but not in t2; a row with pki=3 appears in t2 but not in t2, and various NULLs appear throughout the rows that are shared.

To get the shared table, we throw in the data from t1 and t2 into tmerge, in such way that a real value overwrites a NULL, like this:

insert into tmerge select * from t1;

insert into tmerge select * from t2
on duplicate key update
  a = ifnull(tmerge.a, values(a)),
  b = ifnull(tmerge.b, values(b)),
  c = ifnull(tmerge.c, values(c))
;

So even while I’m inserting values to tmerge, I’m able to check for current value, compared to the value I wish to insert, and have time to make a decision. This is really cool! The result:

> select * from tmerge;
+---------------------+-----+------+------+------+
| pkdt                | pki | a    | b    | c    |
+---------------------+-----+------+------+------+
| 2012-01-01 00:00:00 |   1 |    4 |    5 |    6 |
| 2012-01-02 00:00:00 |   2 | NULL |    5 |    6 |
| 2012-01-03 00:00:00 |   3 |    4 |    5 |    6 |
| 2012-01-04 00:00:00 |   4 |    4 |    5 |    6 |
| 2012-01-05 00:00:00 |   5 |    7 |    8 |    9 |
+---------------------+-----+------+------+------+

Just what we wanted: all possible rows inside; real value takes over NULL whenever possible.

3 thoughts on “Merging tables with INSERT…ON DUPLICATE KEY UPDATE”

Rick James says:

February 21, 2013 at 11:10 pm

Are these identical?
x = ifnull(tmerge.x, values(x))
x = coalesce(tmerge.x, values(x))
shlomi says:

February 22, 2013 at 6:27 am

Rick,
Yes
Pingback: Latest data Industry news round up, Log Buffer #308

This site uses Akismet to reduce spam. Learn how your comment data is processed.

How do you create a merge of the tables, such that all missing rows are completed, and NULLs replaced by actual values when possible?

INSERT…ON DUPLICATE KEY UPDATE

3 thoughts on “Merging tables with INSERT…ON DUPLICATE KEY UPDATE”

Leave a Reply