{"id":6120,"date":"2013-02-21T14:12:48","date_gmt":"2013-02-21T12:12:48","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=6120"},"modified":"2013-02-21T14:12:48","modified_gmt":"2013-02-21T12:12:48","slug":"merging-tables-with-insert-on-duplicate-key-update","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/merging-tables-with-insert-on-duplicate-key-update","title":{"rendered":"Merging tables with INSERT&#8230;ON DUPLICATE KEY UPDATE"},"content":{"rendered":"<p>Had a case recently where I had to merge data from two identically structured tables containing nearly identical data.<\/p>\n<p>&#8220;Nearly identical&#8221; meaning most table data is identical in both; sometimes a row is missing from one of the tables; sometimes same row (according to PK) appears in both, but some columns are NULL is one tables (while others can be NULL in the second).<\/p>\n<p>Otherwise no contradicting data: it was not possible for some data to be &#8220;3&#8221; in one table and &#8220;4&#8221; in the other.<\/p>\n<h4>How do you create a merge of the tables, such that all missing rows are completed, and NULLs replaced by actual values when possible?<\/h4>\n<p><a href=\"http:\/\/www.percona.com\/doc\/percona-toolkit\/2.1\/pt-table-sync.html\">pt-table-sync<\/a> comes to mind: one can do a bidirectional syncing of two tables, and actually stating how to resolve ambiguities (like &#8220;greater value wins&#8221;). Very powerful! An example would be:<\/p>\n<blockquote>\n<pre>pt-table-sync --bidirectional --conflict-column=a --conflict-comparison=greatest --tables ...<\/pre>\n<\/blockquote>\n<p>However I didn&#8217;t actually have any problem with the tables themselves. The two tables were just fine as they were; missing or NULL data does not indicate an error on their part. I wanted to get their merge. <em>pt-table-sync<\/em> is still up for the job: we can duplicate them, merge on the copy&#8230; But I prefer a query over an external script when possible.<\/p>\n<h4>INSERT&#8230;ON DUPLICATE KEY UPDATE<\/h4>\n<p>This <a href=\"http:\/\/dev.mysql.com\/doc\/refman\/5.5\/en\/insert-on-duplicate.html\">MySQL-specific syntax<\/a> is actually quite powerful. It basically says &#8220;if the insert fails due to unique constraint, you get a chance to update the row causing the failure&#8221;. But it also allows for smart setting of the column via the <strong>VALUES()<\/strong> clause. Let&#8217;s present some sample data and then see the solution.<!--more--><\/p>\n<p>Assume the following table definition:<\/p>\n<blockquote>\n<pre>create table t1 (\r\n\u00a0 pkdt datetime,\r\n\u00a0 pki int,\r\n\u00a0 a int,\r\n\u00a0 b int,\r\n\u00a0 c int,\r\n\u00a0 primary key (pkdt, pki)\r\n);<\/pre>\n<\/blockquote>\n<p>Same structure holds for <strong>t1<\/strong>, <strong>t2<\/strong> and <strong>tmerge<\/strong> &#8212; our target table. Looking at table data we have:<\/p>\n<blockquote>\n<pre>&gt; select * from t1;\r\n+---------------------+-----+------+------+------+\r\n| pkdt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | pki | a\u00a0\u00a0\u00a0 | b\u00a0\u00a0\u00a0 | c\u00a0\u00a0\u00a0 |\r\n+---------------------+-----+------+------+------+\r\n| 2012-01-01 00:00:00 |\u00a0\u00a0 1 |\u00a0\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-02 00:00:00 |\u00a0\u00a0 2 | NULL |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-04 00:00:00 |\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 4 | NULL | NULL |\r\n| 2012-01-05 00:00:00 |\u00a0\u00a0 5 | NULL |\u00a0\u00a0\u00a0 8 | NULL |\r\n+---------------------+-----+------+------+------+\r\n\r\n&gt; select * from t2;\r\n+---------------------+-----+------+------+------+\r\n| pkdt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | pki | a\u00a0\u00a0\u00a0 | b\u00a0\u00a0\u00a0 | c\u00a0\u00a0\u00a0 |\r\n+---------------------+-----+------+------+------+\r\n| 2012-01-01 00:00:00 |\u00a0\u00a0 1 |\u00a0\u00a0\u00a0 4 | NULL |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-03 00:00:00 |\u00a0\u00a0 3 |\u00a0\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-04 00:00:00 |\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-05 00:00:00 |\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 7 | NULL |\u00a0\u00a0\u00a0 9 |\r\n+---------------------+-----+------+------+------+<\/pre>\n<\/blockquote>\n<p>We can see a row with <strong>pki=2<\/strong> appears in <strong>t1<\/strong> but not in <strong>t2<\/strong>; a row with <strong>pki=3<\/strong> appears in <strong>t2<\/strong> but not in <strong>t2<\/strong>, and various <strong>NULL<\/strong>s appear throughout the rows that are shared.<\/p>\n<p>To get the shared table, we throw in the data from <strong>t1<\/strong> and <strong>t2<\/strong> into <strong>tmerge<\/strong>, in such way that a real value overwrites a <strong>NULL<\/strong>, like this:<\/p>\n<blockquote>\n<pre>insert into tmerge select * from t1;\r\n\r\ninsert into tmerge select * from t2\r\non duplicate key update\r\n\u00a0 a = <strong>ifnull<\/strong>(tmerge.a, <strong>values<\/strong>(a)),\r\n\u00a0 b = <strong>ifnull<\/strong>(tmerge.b, <strong>values<\/strong>(b)),\r\n\u00a0 c = <strong>ifnull<\/strong>(tmerge.c, <strong>values<\/strong>(c))\r\n;<\/pre>\n<\/blockquote>\n<p>So even while I&#8217;m inserting values to <strong>tmerge<\/strong>, I&#8217;m able to check for current value, compared to the value I wish to insert, and have time to make a decision. This is really cool! The result:<\/p>\n<blockquote>\n<pre>&gt; select * from tmerge;\r\n+---------------------+-----+------+------+------+\r\n| pkdt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | pki | a\u00a0\u00a0\u00a0 | b\u00a0\u00a0\u00a0 | c\u00a0\u00a0\u00a0 |\r\n+---------------------+-----+------+------+------+\r\n| 2012-01-01 00:00:00 |\u00a0\u00a0 1 |\u00a0\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-02 00:00:00 |\u00a0\u00a0 2 | NULL |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-03 00:00:00 |\u00a0\u00a0 3 |\u00a0\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-04 00:00:00 |\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 4 |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 6 |\r\n| 2012-01-05 00:00:00 |\u00a0\u00a0 5 |\u00a0\u00a0\u00a0 7 |\u00a0\u00a0\u00a0 8 |\u00a0\u00a0\u00a0 9 |\r\n+---------------------+-----+------+------+------+<\/pre>\n<\/blockquote>\n<p>Just what we wanted: all possible rows inside; real value takes over <strong>NULL<\/strong> whenever possible.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Had a case recently where I had to merge data from two identically structured tables containing nearly identical data. &#8220;Nearly identical&#8221; meaning most table data is identical in both; sometimes a row is missing from one of the tables; sometimes same row (according to PK) appears in both, but some columns are NULL is one [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[21,20],"class_list":["post-6120","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-sql","tag-syntax"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-1AI","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/6120","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=6120"}],"version-history":[{"count":15,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/6120\/revisions"}],"predecessor-version":[{"id":6135,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/6120\/revisions\/6135"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=6120"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=6120"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=6120"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}