The mystery of MySQL 5.6 excessive buffer pool flushing

April 20, 2014

I'm experimenting with upgrading to MySQL 5.6 and am experiencing an unexplained increase in disk I/O utilization. After discussing this with several people I'm publishing in the hope that someone has an enlightenment on this.

We have a few dozens servers in a normal replication topology. On this particular replication topology we've already evaluated that STATEMENT based replication is faster than ROW based replication, and so we use SBR. We have two different workloads on our slaves, applied by two different HAProxy groups, on three different data centres. Hardware-wise, servers of two groups use either Virident SSD cards or normal SAS spindle disks.

Our servers are I/O bound. A common query used by both workloads looks up data that does not necessarily have a hotspot, and is very large in volume. DML is low, and we only have a few hundred statements per second executed on master (and propagated through replication).

We have upgraded 6 servers from all datacenters to 5.6, both on SSD and spindle disks, and are experiencing the following phenomena:

  • A substantial increase in disk I/O utilization. See a 10 day breakdown (upgrade is visible on 04/14) this goes on like this many days later:

    5.5-to-5.6-disk-utilization-10-days

  • A substantial increase in InnoDB buffer pool pages flush: Mr. Blue is our newly upgraded server; it joins Mr. Green upgraded a couple weeks ago. Mr. Red is still 5.5. This is the only MySQL graph that I could directly relate to the increase in I/O:

    5.5-to-5.6-rise-in-innodb-buffer-pool-pages-flushed

  • No change in workload (it takes some 60 minutes for caches to warm up, so measuring after that time). Same equal share of serving as dictated by HAProxy. Same amount of queries. Same amount of everything.
  • Faster replication speed, on single thread - that's the good part! We see 30% and more improvement in replication speed. Tested by stopping SLAVE SQL_THREAD for a number of pre-defined minutes, then measuring time it took for slave to catch up, up to 10 seconds lag. The results vary depending on the time of day and serving workload on slaves, but it is consistently far faster with 5.6.

The faster replication speed motivates us to continue with the experiment, and is of a significant factor in our decision. However we are concerned about the I/O utilization and excessive flushing.

The above graphs depict the 5.6 status without any configuration changes as compared to 5.5. I took some days to reconfigure the following variables, with no change to the rate of flushed pages (though some changes visible in double-wite buffer writes):

  • innodb_log_file_size=128M/2G
  • innodb_adaptive_flushing:=0/1
  • innodb_adaptive_flushing_lwm:=0/70
  • innodb_max_dirty_pages_pct := 75/90
  • innodb_flush_neighbors:=0/1
  • innodb_max_dirty_pages_pct_lwm:=75/90
  • innodb_old_blocks_time:=0/1000
  • innodb_io_capacity:=50/100/200
  • innodb_io_capacity_max:=50/100/1000
  • relay_log_info_repository:='table'/'file'
  • master_info_repository:='table'/'file'
  • default_tmp_storage_engine:='myisam'/'innodb'
  • eq_range_index_dive_limit:=0/10

And more... Have done patient one-by-one or combinations of the above where it made sense. As you see I began with the usual suspects and moved on to more esoteric stuff. I concentrated on new variables introduced in 5.6, or ones where the defaults have changed, or ones we have explicitly changed the defaults from.

The above is consistent on all upgraded servers. On SSD the disk utilization is lower, but still concerning.

Our use case is very different from the one presented by Yoshinori Matsunobu. and apparently not too many have experienced upgrading to 5.6. I'm hoping someone might shed some light.

tags: , ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

15 Comments to "The mystery of MySQL 5.6 excessive buffer pool flushing"

  1. Justin Swanhart wrote:

    It isn't on the list, but did you try to tweak innodb_io_capacity in combination with innodb_lru_scan_depth?

    The page cleaner might be too zealous for your workload and decreasing innodb_lru_scan_depth might help.

  2. Lalit wrote:

    Hello justin,
    Thank you for sharing buffer pool flushing issue.

    Even we have experienced few challenges in the past after upgraded from 5.5 to 5.6.

    1.After upgrade CPU usage reaching upto 1000% (24 core and 64 GB RAM).Normally it's reach upto 150% max in peak hours.Checked processlist One complex query with join causing a problem,after query optimization it got resolved.

    2.Geographical data replication(GTID enabled) our slave database was lagging. The value of Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0.
    It got fixed by setting
    set global slave_compressed_protocol=1;
    SET GLOBAL slave_net_timeout=60;

    Now we are facing memory leak issue(mysql 5.6.15),mysqld process memory keep increasing.we have to restart mysql service to release it.

  3. shlomi wrote:

    @Justin,
    Double checked. With:

    +------------------------+-------+
    | Variable_name          | Value |
    +------------------------+-------+
    | innodb_io_capacity     | 100   |
    | innodb_io_capacity_max | 100   |
    +------------------------+-------+
    

    and

    +-----------------------+-------+
    | Variable_name         | Value |
    +-----------------------+-------+
    | innodb_lru_scan_depth | 100   |
    +-----------------------+-------+
    

    (100 being the minimal allowed value for lru scan depth)

    -- there is no change at all in flushing rate.

  4. Laurynas wrote:

    Do you leave innodb_buffer_pool_instances at default? In the case you do, you have 1 instance in 5.5 and 8 instances in 5.6. More instances cause more flushing.

    If the number of instances is the same or cannot explain the difference by itself alone, then it is interesting that tuning innodb_log_file_size, innodb_adaptive_flushing*, and innodb_io_capacity_* did not have an effect on flushing rate. This sort of suggests that it is LRU flushing that is going on. Can you check whether it is LRU or flush list flushing?

    In case it is LRU flushing, you will need to tune innodb_max_lru_scan_depth, but then again, it has a rather low default, and increasing it is likely to result in even more flushing.

    In case it is flush list flushing, how does checkpoint age compare in 5.5 and 5.6?

    We also found in developing XtraDB 5.6 that 5.6 flushed more than 5.5, which we addressed by tuning the AF formula - but that too-aggressive flushing only became visible after several other flushing performance fixes we implemented. Would be interesting to see how much does XtraDB flush in your setup.

  5. shlomi wrote:

    @Laurnyas,
    innodb_buffer_pool_instances=8/22, excplicitly set (with 5.5 it was 8; when upgraded left as 8, later changed to 22; same flushign in both cases)

  6. Dimitri Kravtchuk wrote:

    [posting on behalf of Dimitri; SN]

    what is great that since MySQL 5.6 there is more and more instrumentation generally present in the code and yet more in InnoDB via its METRICS table.

    so, let's try to understand first from where the suspected flushing is coming..

    first of all you have to enable METRICS counters :
    mysql> set global innodb_monitor_enable = '%'

    (or just have innodb_monitor_enable='%' in your my.conf file on startup)

    then, using information_schema, start to monitor your flushing activity :

    mysql> select name, count from INNODB_METRICS where name like '%flush%tot%';
    +-------------------------------------+-------+
    | name                                | count |
    +-------------------------------------+-------+
    | buffer_flush_batch_total_pages      |    96 |
    | buffer_flush_neighbor_total_pages   |     0 |
    | buffer_flush_adaptive_total_pages   |     0 |
    | buffer_flush_sync_total_pages       |     0 |
    | buffer_flush_background_total_pages |    96 |
    | buffer_LRU_batch_flush_total_pages  |     0 |
    +-------------------------------------+-------+
    6 rows in set (0.00 sec)
    

    (of course more helpful will be to get the diff/sec graphed from these values to see the tendency)..

    once you'll have an idea to which flush processing your IO activity is related, you may then get a look on all %flush% counters, etc..

  7. shlomi wrote:

    Laurynas, Dimitry,
    thanks both - will take further measurement and report.

  8. shlomi wrote:

    Hair turning grey. After some hunting, I came upon this:

    mysql> select name, count from INNODB_METRICS where name like '%buff%pool%page%';
    +-------------------------+----------+
    | name                    | count    |
    +-------------------------+----------+
    | buffer_pool_pages_total |  3014627 |
    | buffer_pool_pages_misc  | -1808588 |
    | buffer_pool_pages_data  |  4822230 |
    | buffer_pool_pages_dirty |   488383 |
    | buffer_pool_pages_free  |      985 |
    +-------------------------+----------+
    

    At this stage I will stop 5.6 evaluation and wait for higher forces to put sense into life. Have submitted bug report: http://bugs.mysql.com/bug.php?id=72423

  9. shlomi wrote:

    For completeness, it seems like the buffer_pool_pages_* weird values are actually valid, according to this: http://bugs.mysql.com/bug.php?id=59550 (compressed pages are counted inconsistently with total num pages the way I understand it)

  10. Rolando Edwards wrote:

    It's ironic how in the MySQL Doc on "Controlling the Flushing Rate of Dirty Pages from the InnoDB Buffer Pool" reads a little differently of late.

    MySQL 5.5 says : InnoDB performs certain tasks in the background, including flushing of dirty pages (those pages that have been changed but are not yet written to the database files) from the buffer pool, a task performed by the master thread. Currently, InnoDB aggressively flushes buffer pool pages if the percentage of dirty pages in the buffer pool exceeds innodb_max_dirty_pages_pct.

    MySQL 5.6 says : InnoDB performs certain tasks in the background, including flushing of dirty pages (those pages that have been changed but are not yet written to the database files) from the buffer pool. Currently, InnoDB flushes buffer pool pages if the percentage of dirty pages in the buffer pool exceeds innodb_max_dirty_pages_pct.

    Interstng how the MySQL 5.6 Doc does not say "aggressively". Maybe something did change.

    Someone should look into whether the introduction of innodb_flush_neighbors may have altered the way flushing is evaluated and subsequently performed.

    The option innodb_flush_neighbors was introduced in MySQL 5.6. It's default behavior to flush contiguous dirty pages.

    MySQL 5.5 flushing can rightly be called "aggressive" if it flushes anything dirty in the buffer pool, contiguous or not.

    MySQL 5.6 is supposed to be conservative in its default innodb_flush_neighbors setting. If the flushing algorithm has to look for contiguous dirty pages only, there should be reason for excessive flushing.

    If this algorithm is faulty and not properly taking innodb_max_dirty_pages_pct in account, this excessive flushing could possibly be a manifestation of such code.

    Of course, all this is just my conjecture. Notwithstanding, I would like to see what your submitted bug report turns up.

  11. Rolando Edwards wrote:

    Correction on my post : I meant to say "If the flushing algorithm has to look for contiguous dirty pages only, there should be no reason for excessive flushing."

  12. shlomi wrote:

    @Rolando, as per my comment #9, my submitted bug report turns into "valid behaviour". I am still without solution for excessive flushing.

  13. Vassil Velichkov wrote:

    We have upgraded one of our MySQL servers from Percona MySQL 5.5 to Percona MySQL 5.6, while at the same time we have upgraded another server from plain 5.5. to 5.6. After the upgrade, without any changes in the application code, both servers experience constant spikes in flushing activity, well beyond the innodb_io_capaicty_max setting. In the last 20 hours I've tried any combination of the same parameters described by Shlomi in his post + some more:
    innodb_flush_method = O_DIRECT / O_DIRECT_NO_FSYNC (actually this helped a bit - same IOPS spikes, but causing only 1/3 of the max disk latency
    innodb_lru_scan_depth = 1024 / 2048 / 4096
    innodb_flushing_avg_loops = 10 / 20 / 30 / 40 / 50 / 60

    I will continue to investigate, but what makes feel weird is that the adaptive flushing never really kicks in, until is too late and then buffer_flush_background_total_pages shows rapid increase and all 50K dirty pages are being flushed at a rate approx. 3 x innodb_io_capacity_max...

  14. Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner | Outbrain Techblog wrote:

    […] not GTID? We’re using MySQL 5.5. We’ve had issues while evaluating 5.6; and besides, migrating to GTID is a mess (several solutions or proposed […]

  15. Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner | code.openark.org wrote:

    […] The mystery of MySQL 5.6 excessive buffer pool flushing […]

Leave Your Comment

 

 
Powered by Wordpress and MySQL. Theme by openark.org