Refactoring replication topology with Pseudo GTID

This post describes in detail the method of using Pseudo GTID to achieve unplanned replication topology changes, i.e. connecting two arbitrary slaves, or recovering from a master failure even as all its slaves are hanging in different positions.

Please read Pseudo GTID and Pseudo GTID, RBR as introduction.

Consider the following case: the master dies unexpectedly, and its three slaves are all hanging, not necessarily at same binary log file/position (network broke down while some slaves managed to salvage more entries into their relay logs than others)

orchestrator-failed-master

(Did you notice the “Candidate for master” message? To be discussed shortly)

GTID

With GTID each transaction (and entry in the binary log) is associated with a unique mark — the Global Transaction ID. Just pick the slave with the most advanced GTID to be the next master, and just CHANGE MASTER TO MASTER_HOST=’chosen_slave’ on the other slaves, and everything magically works. A slave knows which GTID it has already processed, and can look that entry on its master’s binary logs, resuming replication on the one that follows.

How does that work? The master’s binary logs are searched for that GTID entry. I’m not sure how brute-force this is, since I’m aware of a subtlety which requires brute-force scan of all binary logs; I don’t actually know if it’s always like that.

Pseudo GTID

We can mimick that above, but our solution can’t be as fine grained. With the injection of Pseudo GTID we mark the binary log for unique entries. But instead of having a unique identifier for every entry, we have a unique identifier for every second, 10 seconds, or what have you, with otherwise normal, non-unique entries in between our Pseudo GTID entries.

Recognizing which slave is more up to date

Given two slaves, which is more up to date?

  • If both replicate(d) from same master, a SHOW SLAVE STATUS comparison answers (safe method: wait till SQL thread catches up with broken IO thread, compare relay_master_log_file, exec_master_log_pos on both machines). This is the method by which the above “Candidate for master” message is produced.
  • If one is/was descendent of the other, then obviously it is less advanced or equals its ancestor.
  • Otherwise we’re unsure – still solvable via bi-directional trial & error, as explained later on.

For now, let’s assume we know which slave is more up to date (has received and executed more relay logs). Let’s call it S1, whereas the less up-to-date will be S2. This will make our discussion simpler. Continue reading » “Refactoring replication topology with Pseudo GTID”

Pseudo GTID, Row Based Replication

This post continues Pseudo GTID, in a series of posts describing an alternative to using MySQL GTIDs.

The solution offered in the last post does not work too well for row based replication. The binary log entries for the INSERT statement look like this:

# at 1020
# at 1074
#141020 12:36:21 server id 1  end_log_pos 1074  Table_map: `test`.`pseudo_gtid` mapped to number 33
#141020 12:36:21 server id 1  end_log_pos 1196  Update_rows: table id 33 flags: STMT_END_F

BINLOG '
lddEVBMBAAAANgAAADIEAAAAACEAAAAAAAEABHRlc3QAC3BzZXVkb19ndGlkAAMDBw8CQAAE
lddEVBgBAAAAegAAAKwEAAAAACEAAAAAAAEAA///+AEAAACL10RUJDg2ZmRhMDk1LTU4M2MtMTFl
NC05NzYyLTNjOTcwZWEzMWVhOPgBAAAAlddEVCQ4Y2YzOWMyYy01ODNjLTExZTQtOTc2Mi0zYzk3
MGVhMzFlYTg=
'/*!*/;

Where’s our unique value? Encoded within something that cannot be trusted to be unique. Issuing mysqlbinlog –verbose helps out:

BEGIN
/*!*/;
# at 183
# at 237
#141020 12:35:51 server id 1  end_log_pos 237   Table_map: `test`.`pseudo_gtid` mapped to number 33
#141020 12:35:51 server id 1  end_log_pos 359   Update_rows: table id 33 flags: STMT_END_F

BINLOG '
d9dEVBMBAAAANgAAAO0AAAAAACEAAAAAAAEABHRlc3QAC3BzZXVkb19ndGlkAAMDBw8CQAAE
d9dEVBgBAAAAegAAAGcBAAAAACEAAAAAAAEAA///+AEAAABt10RUJDc1MWJkYzEwLTU4M2MtMTFl
NC05NzYyLTNjOTcwZWEzMWVhOPgBAAAAd9dEVCQ3YjExZDQzYy01ODNjLTExZTQtOTc2Mi0zYzk3
MGVhMzFlYTg=
'/*!*/;
### UPDATE `test`.`pseudo_gtid`
### WHERE
###   @1=1
###   @2=1413797741
###   @3='751bdc10-583c-11e4-9762-3c970ea31ea8'
### SET
###   @1=1
###   @2=1413797751
###   @3='7b11d43c-583c-11e4-9762-3c970ea31ea8'

and that’s something we can work with. However, I like to do stuff from within MySQL, and rely as little as possible on external tools. How do the binary log entries look via SHOW BINLOG EVENTS? Not good.

master [localhost] {msandbox} (test) > show binlog events in 'mysql-bin.000058' limit 20;
+------------------+------+-------------+-----------+-------------+---------------------------------------+
| Log_name         | Pos  | Event_type  | Server_id | End_log_pos | Info                                  |
+------------------+------+-------------+-----------+-------------+---------------------------------------+
| mysql-bin.000058 |    4 | Format_desc |         1 |         107 | Server ver: 5.5.32-log, Binlog ver: 4 |
| mysql-bin.000058 |  107 | Query       |         1 |         183 | BEGIN                                 |
| mysql-bin.000058 |  183 | Table_map   |         1 |         237 | table_id: 33 (test.pseudo_gtid)       |
| mysql-bin.000058 |  237 | Update_rows |         1 |         359 | table_id: 33 flags: STMT_END_F        |
| mysql-bin.000058 |  359 | Xid         |         1 |         386 | COMMIT /* xid=5460 */                 |
| mysql-bin.000058 |  386 | Query       |         1 |         462 | BEGIN                                 |
| mysql-bin.000058 |  462 | Table_map   |         1 |         516 | table_id: 33 (test.pseudo_gtid)       |
| mysql-bin.000058 |  516 | Update_rows |         1 |         638 | table_id: 33 flags: STMT_END_F        |
| mysql-bin.000058 |  638 | Xid         |         1 |         665 | COMMIT /* xid=5471 */                 |
| mysql-bin.000058 |  665 | Query       |         1 |         741 | BEGIN                                 |
| mysql-bin.000058 |  741 | Table_map   |         1 |         795 | table_id: 33 (test.pseudo_gtid)       |
| mysql-bin.000058 |  795 | Update_rows |         1 |         917 | table_id: 33 flags: STMT_END_F        |
| mysql-bin.000058 |  917 | Xid         |         1 |         944 | COMMIT /* xid=5474 */                 |
| mysql-bin.000058 |  944 | Query       |         1 |        1020 | BEGIN                                 |
| mysql-bin.000058 | 1020 | Table_map   |         1 |        1074 | table_id: 33 (test.pseudo_gtid)       |
| mysql-bin.000058 | 1074 | Update_rows |         1 |        1196 | table_id: 33 flags: STMT_END_F        |
| mysql-bin.000058 | 1196 | Xid         |         1 |        1223 | COMMIT /* xid=5476 */                 |
| mysql-bin.000058 | 1223 | Query       |         1 |        1299 | BEGIN                                 |
| mysql-bin.000058 | 1299 | Table_map   |         1 |        1353 | table_id: 33 (test.pseudo_gtid)       |
| mysql-bin.000058 | 1353 | Update_rows |         1 |        1475 | table_id: 33 flags: STMT_END_F        |
+------------------+------+-------------+-----------+-------------+---------------------------------------+

The representation of row-format entries in the SHOW BINLOG EVENTS output is really poor. Why, there’s nothing to tell me at all about what’s been done, except that this is some operation on test.pseudo_gtid. Obviously I cannot find anything unique over here.

Not all is lost. How about DDL statements? Those are still written in SBR format (there’s no rows to log upon creating a table). A solution could be somehow manipulating a unique value in a DDL statement. There could be various such solutions, and I chose to use a CREATE VIEW statement, dynamically composed of a UUID(): Continue reading » “Pseudo GTID, Row Based Replication”

Making UUID() and RAND() replication safe

MySQL’s UUID() and RAND() functions both provide with (pseudo) indeterministic result. UUID()‘s result is moreover bound to the host on which it executes. For this reason, both are unsafe to replicate with STATEMENT binlog format. As an example, consider:

master> create table test.uuid_test (id int, u varchar(64));

master> insert into test.uuid_test values (1, UUID());
Query OK, 1 row affected, 1 warning (0.03 sec)

master> select * from test.uuid_test;
+------+--------------------------------------+
| id   | u                                    |
+------+--------------------------------------+
|    1 | 7e3596d8-56ac-11e4-b284-3c970ea31ea8 |
+------+--------------------------------------+

The warning we got on the insert directly relates to the following inconsistency on a slave:

slave1> select * from test.uuid_test;
+------+--------------------------------------+
| id   | u                                    |
+------+--------------------------------------+
|    1 | 7e379d63-56ac-11e4-8477-3c970ea31ea8 |
+------+--------------------------------------+

The data on the slave is clearly inconsistent with the master’s. The slave, replicating via STATEMENT binlog format, re-executes the INSERT command and gets a different UUID value.

External

One solution to the above is to generate the UUID value from your application. By the time MySQL gets the INSERT statement, the UUID value is a constant string, as far as MySQL is concerned.

Internal

However there’s a way to do it from within MySQL, by decoupling the UUID() function from the INSERT statement. It takes a session variable. Consider: Continue reading » “Making UUID() and RAND() replication safe”

orchestrator 1.1.18: new features, support for orchestrator-agent

Outbrain‘s orchestrator 1.1.18 is released:

  • Support for orchestrator-agent (see announcement): agent pages, support for agent actions, initiation of seeds (provisioning of new/corrupted servers), auditing of seeds.
  • Clusters dashboard
  • Support for long query auditing
  • SSL
  • Proxy authentication (e.g. apache2 serving as reverse-proxy with LDAP)
  • User control
  • Better slave moving rules.

Quick links:

orchestrator now allows for seeding/provisioning of servers via orchestrator-agent. It communicates with agents executing on the MySQL hosts and coordinate transfer of data between them. orchestrator now supports invocation and auditing of seeding operations, and protects you from breaking your seeds. The orchestrator-agent is a solution to Outbrain’s specific use case, and may not appeal to the greater crowd. Nonetheless it is extendible and is released as open source. Continue reading » “orchestrator 1.1.18: new features, support for orchestrator-agent”

Announcing orchestrator-agent

orchestrator-agent is a side-kick, complementary project of orchestrator, implementing a daemon service on one’s MySQL hosts which communicates with and accepts commands from orchestrator, built with the original purpose of providing an automated solution for provisioning new or corrupted slaves.

It was built by Outbrain, with Outbrain’s specific use case in mind. While we release it as open source, only a small part of its functionality will appeal to the public (this is why it’s not strictly part of the orchestrator project, which is a general purpose, wide-audience solution). Nevertheless, it is a simple implementation of a daemon, such that can be easily extended by the community. The project is open for pull-requests!

A quick breakdown of orchestrator-agent is as follows:

  • Executes as a daemon on linux hosts
  • Interacts and invokes OS commands (via bash)
  • Does not directly interact with a MySQL server running on that host (does not connect via mysql credentials)
  • Expects a single MySQL service on host
  • Can control the MySQL service (e.g. stop, start)
  • Is familiar with LVM layer on host
  • Can take LVM snapshots, mount snapshots, remove snapshots
  • Is familiar with the MySQL data directory, disk usage, file system
  • Can send snapshot data from a mounted snapshot on a running MySQL host
  • Can prepare data directory and receive snapshot data from another host
  • Recognizes local/remote datacenters
  • Controlled by orchestrator, two orchestrator-agents implement an automated and audited solution for seeding a new/corrupted MySQL host based on a running server.

Continue reading » “Announcing orchestrator-agent”

Orchestrator 1.0.5: refactoring masters, multi-master replication

Outbrain’s orchestrator Version 1.0.5 is released.

Quick links: Orchestrator Manual, FAQ, Downloads

Orchestrator now supports refactoring of masters via master-master topologies. It now allows promoting slaves as co-masters and detachment of instances from a co-master topology, effectively allowing for replacing an active master.

Like this

Drag a master:

orchestator-cm-simple-drag-master-01

Onto one of its slaves: Continue reading » “Orchestrator 1.0.5: refactoring masters, multi-master replication”

Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner

I’m happy to announce the availability of Outbrain‘s Orchestrator: MySQL replication management & visualization tool.

orchestrator - simple topology

  • Orchestrator reads your replication topologies (give it one server – be it master or slave – in each topology, and it will reveal the rest).
  • It keeps a state of this topology.
  • It can continuously poll your servers to get an up to date topology map.
  • It visualizes the topology in a clear and slick D3 tree.
  • It allows you to modify your topology; move slaves around. You can use the command line variation, the JSON API, or you can use the web interface.

Quick links: Orchestrator Manual, FAQ, Downloads

Nothing like nice screenshots

To move slaves around the topology (repoint a slave to a different master) through orchestrator‘s web interface, we use Drag and Drop, Continue reading » “Introducing Orchestrator: manage and visualize your MySQL replication topologies and get home for dinner”

Using deep nested replication topologies

If you’re running more than a few slaves in a replication topology, you might choose to use deeply nested replication: slaves replicating from other slaves at 2, 3 or even 4 levels. There are pros and cons to such topologies, discussed below.

A simple, small deep nested topology is depicted below (it is also a real production topology of ours):

topology-sample-02

Two slaves, srv-4 and srv-8 act as local masters to yet other slaves. Why would we want to have this complexity?

Pros

  • Reduce load on master: too many slaves replicating from single master means the master becomes loaded with serving the binary logs. Typically, when all slaves are up to date, this isn’t a big deal, since they all get served roughly the same entries, and caching works great. If not all are in sync, the master needs to look up different log entries and pays with more disk I/O.
  • Reduce network load. We serve from three different data centres. Our master is in one DC, and our slaves are spread through all three. Inter-DC network is naturally slower; it is also more expensive, hence more easily saturated. Reducing cross-DC network is done across all our systems, including MySQL. srv-4, for example, could depict a slave that is a local master in its own DC, serving srv-5, srv-6, srv-7 all in the same DC, hence only using cross-DC network for one slave instead of four. A bit over-simplistic example but true.
  • Failover. MHA does a good job at synchronizing slaves of same master by figuring out the missing binary log entries for each slave. It should do well within a single region, but I do not know that it would do the same cross region (I’m assuming the binlog entries copy should work, but I haven’t tried it cross region). In case of a disaster such as an entire DC going down (we actually had such a case a couple weeks ago; power went out for the entire DC), we have a designated master into which we can fail over in each other DC, and which contains enough slaves (from each remaining DCs) to keep serving. That it, we’re willing to skip the fancy syncing and just point to a newly promoted master, with the benefit that the entire replication topology under it is intact.
  • Testing & upgrades. For example, I might want to upgrade to 5.6. Upgrading a slave from 5.5 to 5.6 is a good start; we look at replication and see that nothing gets broken. But how will our production master behave with 5.6? Put some more slaves under your newly upgraded 5.6 server and get a clearer picture. At some stage you might just promote this entire subtree as the new topology.

Here’s another topology; DC info is not depicted in this image, but you can guess what designated masters we have: Continue reading » “Using deep nested replication topologies”

From Percona Server to MySQL and back to Percona Server: beware of crash safe replication info

We’re migrating some of our “vanilla” MySQL 5.5 servers to Percona Server 5.5. One of the major incentives is the crash-safe replication feature, allowing slaves to die (power failure) and resume replication without losing position in relay logs.

Whether or not we will migrate all our servers depends on further benchmarking; so far we’ve noticed unexpected results, but these are still premature to publish.

However the fact that we are using both MySQL & Percona Server has led us into a peculiar situation which I’d like to share. We reseed our servers via LVM snapshots. If we need a new machine, or have a corrupted slave, we capture an image of a running slave and duplicate it, a process which takes the better part of a day. This duplicates not only the data, of course, but also the relay logs, the relay-log.info file, master.info file, implying the position within the topology.

With crash safe replication this also means the transactional relay log position. Recap: crash safe replication writes, per transaction, the relay log status into ibdata1 file. So the relay log info in ibdata1 is in perfect alignment with your committed transactions. Upon server startup, Percona Server reads the info from ibdata1 and overwrites relay-log.info file (it completely disregards whatever was in that file prior to startup).

Can you guess what could get wrong here? Here’s the scenario we had; the same problem can unfold in different scenarios.

Take a look at the following topology:

simple topology

(this image is an actual online visualization of a replication topology; for purposes of this blog it’s a sandbox topology on my laptop. Please stand by for some very cool open source release announcement shortly)

Continue reading » “From Percona Server to MySQL and back to Percona Server: beware of crash safe replication info”