The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.
Sounds very attractive, and has a nice syntax as well: the same syntax as a normal INSERT INTO’s. It certainly has a nicer syntax than INSERT INTO … ON DUPLICATE KEY UPDATE, and it’s certainly shorter than using a SELECT to see if a row exists, then doing either INSERT or UPDATE.
But weak hearted people as myself should be aware of the following: it is a heavyweight solution. It may be just what you were looking for in terms of ease of use, but the fact is that on duplicate keys, a DELETE and INSERT are performed, and this calls for a closer look.
Whenever a row is deleted, all indexes need to be updated, and most importantly the PRIMARY KEY. When a new row is inserted, the same happens. Especially on InnoDB tables (because of their clustered nature), this means much overhead. The restructuring of an index is an expensive operation. Index nodes may need to be merged upon DELETE. Nodes may need to be split due to INSERT. After many REPLACE INTO executions, it is most probable that your index is more fragmented than it would have been, had you used SELECT/UPDATE or INSERT INTO … ON DUPLICATE KEY
Also, there’s the notion of “well, if the row isn’t there, we create it. If it’s there, it simply get’s updated”. This is false. The row doesn’t just get updated, it is completely removed. The problem is, if there’s a PRIMARY KEY on that table, and the REPLACE INTO does not specify a value for the PRIMARY KEY (for example, it’s an AUTO_INCREMENT column), the new row gets a different value, and this may not be what you were looking for in terms of behavior.
Many uses of REPLACE INTO have no intention of changing PRIMARY KEY (or other UNIQUE KEY) values. In that case, it’s better left alone. On a production system I’ve seen, changing REPLACE INTO to INSERT INTO … ON DPLICATE KEY resulted in a ten fold more throughput (measured in queries per second) and a drastic decrease in IO operations and in load average.
@tandu,
see Arjen’s comment #1 on REPLACE deleting more than one row when more than one contraint matches.
Anyway I wouldn’t go as far as to fear using wither REPLACE or ON DUPLICATE KEY. It’s like copy_pasting the entire tabel for backup befroe doing a DELETE. You should have perfect understanding on how both work, and based on that decide whether you want to use them…
I vote for MySQL implementing this UPSERT command… There are a lot of things in MySQL which don’t follow the ANSI SQL standard!
Newly onboard as the lead developer for my company, I have already run into this issue BIGTIME.
The previous developer(s) were quite fond of the “REPLACE INTO” syntax; the company’s web app was littered with it. Many problems plagued the system, the most common was the “random” loss of data.
As it turned out, the database has been in use for many years; some tables holding more than 1M records, as well as keyed with composites. Inevitably, someone would complain to the mother company’s I.T. dept about the app being slow. Mother company’s I.T. would just simply restart the MySQL instance in order to get performance back up. This occurred many times per day.
Of course, the sluggishness was due to the fact that the db engine was rebuilding the indexes for large tables. Restarting the server simply resulted in nightmarish situations.
Long story short: a good db programmer SHOULD understand the semantics of “REPLACE INTO” and carefully consider whether it is necessary.
One thing that was not mentioned in this article was the additional overhead of foreign-key relationships and how those tables are impacted.
@Tony,
Thanks for sharing this story. You are right about the overhead of foreign keys. This is to some extent due to the indexes on which the foreign keys rely, and, well, the deletion of child tables with “random loss of data”…
One of the most useful features of ON DUPLICATE KEY UPDATE is that not all columns need to be specified in the UPDATE clause.
This allows adding new rows which either are fully qualified or use some or all column defaults while only modifying the desired subset of columns when the key is already present.
REPLACE INTO fails in this respect — once the row is deleted, only the values in the new definition will be populated, so you’d better have somehow managed to define everything you don’t want unset/defaulted for existing rows, in which case you’re basically reinserting the entire table anyway.
This could well be another source of the data loss @Tony happened upon.