REPLACE INTO: think twice

December 17, 2008

The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.

Sounds very attractive, and has a nice syntax as well: the same syntax as a normal INSERT INTO's. It certainly has a nicer syntax than INSERT INTO ... ON DUPLICATE KEY UPDATE, and it's certainly shorter than using a SELECT to see if a row exists, then doing either INSERT or UPDATE.

But weak hearted people as myself should be aware of the following: it is a heavyweight solution. It may be just what you were looking for in terms of ease of use, but the fact is that on duplicate keys, a DELETE and INSERT are performed, and this calls for a closer look.

Whenever a row is deleted, all indexes need to be updated, and most importantly the PRIMARY KEY. When a new row is inserted, the same happens. Especially on InnoDB tables (because of their clustered nature), this means much overhead. The restructuring of an index is an expensive operation. Index nodes may need to be merged upon DELETE. Nodes may need to be split due to INSERT. After many REPLACE INTO executions, it is most probable that your index is more fragmented than it would have been, had you used SELECT/UPDATE or INSERT INTO ... ON DUPLICATE KEY

Also, there's the notion of "well, if the row isn't there, we create it. If it's there, it simply get's updated". This is false. The row doesn't just get updated, it is completely removed. The problem is, if there's a PRIMARY KEY on that table, and the REPLACE INTO does not specify a value for the PRIMARY KEY (for example, it's an AUTO_INCREMENT column), the new row gets a different value, and this may not be what you were looking for in terms of behavior.

Many uses of REPLACE INTO have no intention of changing PRIMARY KEY (or other UNIQUE KEY) values. In that case, it's better left alone. On a production system I've seen, changing REPLACE INTO to INSERT INTO ... ON DPLICATE KEY resulted in a ten fold more throughput (measured in queries per second) and a drastic decrease in IO operations and in load average.

tags:
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

23 Comments to "REPLACE INTO: think twice"

  1. Arjen Lentz wrote:

    Two things...

    REPLACE is completely different from the INSERT ... ON DUPLICATE KEY UPDATE ... because replace removes and replaces rows complete whereas the no dup key update construct allows you to update values in the existing row.

    If I remember correctly, MySQL will actually re-use the index slot, if possible, so it's not as inefficient as you might think. In any case, the alternative equivalent to REPLACE is DELETE+INSERT, which is definitely no faster and quite likely slower.

    By the way, REPLACE can actually delete more than one row, this can happen if you also have other UNIQUE constraints. Again, this can be a very useful feature, when used appropriately.

  2. shlomi wrote:

    Arjen, thanks,

    With regard to re-using the index slot, consider the following:

    DROP TABLE IF EXISTS `person`;
    CREATE TABLE `person` (
    `person_id` int(11) NOT NULL auto_increment,
    `name` varchar(60) character set utf8 NOT NULL,
    `social_security_number` bigint(20) NOT NULL,
    PRIMARY KEY (`person_id`),
    UNIQUE KEY `social_security_number` (`social_security_number`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

    TRUNCATE TABLE person;
    INSERT INTO person (name, social_security_number) VALUES ('Alice', 123456);
    INSERT INTO person (name, social_security_number) VALUES ('Bob', 787878);
    REPLACE INTO person (name, social_security_number) VALUES ('Bob', 787878);

    Both on InnoDB and on MyISAM, the results are:

    +-----------+-------+------------------------+
    | person_id | name | social_security_number |
    +-----------+-------+------------------------+
    | 1 | Alice | 123456 |
    | 3 | Bob | 787878 |
    +-----------+-------+------------------------+

    If the primary key value changes, I don't think the slot can be reused. What do you think?

    With regard to having more than one row removed - nice! I've never encountered such a situation. But still it seems to me more a hack. I agree that it must be used appropriately.

  3. mike wrote:

    This has bitten me a couple times, and usually what I really want is the ON DUPLICATE KEY, not the REPLACE INTO. Not for space reusage but for the row with the unique key I wanted to update if it existed.

  4. Steve W wrote:

    I would add that the ON DUPLICATE KEY performs the operation much faster then a REPLACE INTO. It's probably obvious that this true, but it can cut the processing time more then half if you don't have that many updates to run.

    On just a 256 row database, my page execution times went from 1.2 seconds to .43 seconds. This was doing a REPLACE into on all the rows (running the same sql twice). By updating it to do the ON DUPLICATE, the same script is much faster.

    The database is running some triggers and stored procedures to update a random row scheme. By not having the DELETE/INSERT these triggers are never called. YMMV on the actual speed increase.

    It takes a bit more planning with PHP scripts to account for the unique keys and not include them in the ON DUPLICATE portion, but it's worth it.

  5. shlomi wrote:

    Mike - exactly my point. I have the (unconfirmed) feeling most people are just using REPLACE INTO to do an "insert if not exists or update if exists", in which case it is not the optimal solution.

    Steve - thanks for sharing

  6. Brian wrote:

    It is worth noting, however, that performing an INSERT/ON DUPLICATE KEY UPDATE, because of its dual nature, will not give predictable results when requesting the "last insert ID" from the DB engine. The last insert ID is only updated when the engine performs an update.

  7. saviola wrote:

    The two statements have different impact for INNODb table.
    The REPLACE INTO acts as DELETE/INSERT for duplicates.
    The INSERT ON DUPLIACTE UPDATE is true update.

    If you have a child table defined with “on delete CASCADE”, the REPLACE INTO will delete the child record too.

    I will use INSERT ON DUPLICATE rather than the REPLACE INTO for the above reason.

  8. ricardo wrote:

    i think you're missing the point of the replace query. It is not meant to substitute insert on duplicate key update, but it is the only solution if you want to do bulk updates coming from a subquery. insert on duplicate key update works well if you want to update a few rows at a time, not sure how u can use it to update thousands of rows w/ a single subquery.

  9. shlomi wrote:

    @ricardo,

    Agreed on bulk updates.

    Nevertheless it is common to see people using REPLACE INTO when they could have used INSERT INTO ... ON DUPLICATE KEY UPDATE for single rows.

  10. Jeremy wrote:

    I find the syntax for ON DUPLICATE KEY UPDATE to be quite obnoxious. I should be able to write a single SQL statement that either inserts or updates.

    Lets say I have an object with some properties. I want to store those properties in a table. In my application I have to write two queries: One query to INSERT the row to the table, and another query to UPDATE that same row. To make matters worse, the syntax for 'UPDATE' is very arcane.

    The INSERT INTO ... ON DUPLICATE KEY UPDATE really does not help here much because I still have to *write* both the INSERT statement and the UPDATE statement even though they are in the same query now, for example:

    INSERT INTO objects (val1, val2, val3, val4) VALUES (1, 2, 3, 4) ON DUPLICATE KEY UPDATE val1=1, val2=2, val3=3, val4=4;

    I should not *need* to specify val1=1, val2=2, val3=3, val4=4 at the end of the query. I want to update ALL of the columns, why isn't there an option for that?

    REPLACE INTO also does not help me? Why? People seem to overlook the fact that a "relational database" is *supposed* to have foreign keys and the restraints that go with them! If you use REPLACE INTO, it first DELETES the existing record (and all child records), and then re-inserts a new copy of the record. How useless is that? At least give me an option to disable the foreign key constraints when using REPLACE, or give me a usable ON DUPLICATE KEY UPDATE syntax.

  11. shlomi wrote:

    @Jeremy,
    I think you're looking for what's "known" as the UPSERT statement. There isn't one.
    Really, the SQL syntax, as defined by ANSI SQL, is far from being friendly.

  12. Z wrote:

    Jeremy: You can easy reuse values in the UPDATE part from the INSERT.

    INSERT INTO objects (val1, val2, val3, val4) VALUES (1, 2, 3, 4) ON DUPLICATE KEY UPDATE val1=VALUES(val1), val2=VALUES(val2), val3=VALUES(val3), val4=VALUES(val4);

    http://dev.mysql.com/doc/refman/5.1/en/miscellaneous-functions.html#function_values

  13. Some Dude wrote:

    @Z

    How is that any better? It's still a lot of extra typing.

  14. Jeremy wrote:

    @Z,

    I know about the val1=VALUES(val1), val2=VALUES(val2), val3=VALUES(val3), val4=VALUES(val4); option, however this increases the size of the SQL dramatically. It's not uncommon for there to be tables with 100+ columns, updating all of those columns would require my application to generate SQL statements that are many times longer, which slows down the query generation, and increases the time it takes to parse the query on the other end. It's not necessarily extra 'typing', but it's just inefficient all the way around, especially when you need to update many rows at a time.

  15. tandu wrote:

    Nice article. REPLACE INTO is dangerous, as it can actually cause you to lose data unpredictably. This has happened to me before. Didn't know about INSERT INTO .. ON DUPLICATE KEY UPDATE before this article. Seems okay, but I don't know. I like to update and then do an insert if no rows were affected. Silly, but not dangerous.

  16. shlomi wrote:

    @tandu,
    see Arjen's comment #1 on REPLACE deleting more than one row when more than one contraint matches.

    Anyway I wouldn't go as far as to fear using wither REPLACE or ON DUPLICATE KEY. It's like copy_pasting the entire tabel for backup befroe doing a DELETE. You should have perfect understanding on how both work, and based on that decide whether you want to use them...

  17. Renowned Media wrote:

    I vote for MySQL implementing this UPSERT command... There are a lot of things in MySQL which don't follow the ANSI SQL standard!

  18. Tony wrote:

    Newly onboard as the lead developer for my company, I have already run into this issue BIGTIME.

    The previous developer(s) were quite fond of the "REPLACE INTO" syntax; the company's web app was littered with it. Many problems plagued the system, the most common was the "random" loss of data.

    As it turned out, the database has been in use for many years; some tables holding more than 1M records, as well as keyed with composites. Inevitably, someone would complain to the mother company's I.T. dept about the app being slow. Mother company's I.T. would just simply restart the MySQL instance in order to get performance back up. This occurred many times per day.

    Of course, the sluggishness was due to the fact that the db engine was rebuilding the indexes for large tables. Restarting the server simply resulted in nightmarish situations.

    Long story short: a good db programmer SHOULD understand the semantics of "REPLACE INTO" and carefully consider whether it is necessary.

    One thing that was not mentioned in this article was the additional overhead of foreign-key relationships and how those tables are impacted.

  19. shlomi wrote:

    @Tony,
    Thanks for sharing this story. You are right about the overhead of foreign keys. This is to some extent due to the indexes on which the foreign keys rely, and, well, the deletion of child tables with "random loss of data"...

  20. rsandwick3 wrote:

    One of the most useful features of ON DUPLICATE KEY UPDATE is that not all columns need to be specified in the UPDATE clause.

    This allows adding new rows which either are fully qualified or use some or all column defaults while only modifying the desired subset of columns when the key is already present.

    REPLACE INTO fails in this respect -- once the row is deleted, only the values in the new definition will be populated, so you'd better have somehow managed to define everything you don't want unset/defaulted for existing rows, in which case you're basically reinserting the entire table anyway.

    This could well be another source of the data loss @Tony happened upon.

  21. Sunset wrote:

    Thanks guys, yesterday I spent 3 hours trying to get the best method of INSERT INTO/REPLACE INTO working for my site tweet "approver" where we scan for tweets and manually approve the good ones that can be displayed on our site.

    I eventually settled for a bit of a mix of on duplicate col=VALUES(col) and col2=col2 as before we save the tweets we do some preg_replace stuff on it so no need to update it with the raw tweet again:

    $sql = "INSERT INTO tweets (id, tweet, date, who, approved)
    VALUES (\"$twitter_tweet_id\", \"{$this->CI->db->escape($tweet_val->title)}\", \"{$this->CI->db->escape($tweet_val->pubDate)}\", \"{$this->CI->db->escape($tweet_val->author)}\", 'no')
    ON DUPLICATE KEY UPDATE id=VALUES(id), tweet=tweet, date=VALUES(date), who=who, approved=approved";

    I vote for UPSERT, too though; would have been much easier.

    Cheers,

    Kevin

  22. replace into | Brinsmead Data Services wrote:

    [...] had been using mysql’s replace into syntax until I read this article http://code.openark.org/blog/mysql/replace-into-think-twice which makes a lot of [...]

  23. parminder Singh wrote:

    REPLACE INTO is very good i often use this But the problem is that if you have used foreign key constraints On delete cascade than this will be big headache for you , actually i have already faced this problem , That's why i sharing with you guys, coz may be it will be helpful for some body

Leave Your Comment

 

 
Powered by Wordpress and MySQL. Theme by openark.org