This post describes in detail the method of using Pseudo GTID to achieve unplanned replication topology changes, i.e. connecting two arbitrary slaves, or recovering from a master failure even as all its slaves are hanging in different positions.
Please read Pseudo GTID and Pseudo GTID, RBR as introduction.
Consider the following case: the master dies unexpectedly, and its three slaves are all hanging, not necessarily at same binary log file/position (network broke down while some slaves managed to salvage more entries into their relay logs than others)
(Did you notice the “Candidate for master” message? To be discussed shortly)
GTID
With GTID each transaction (and entry in the binary log) is associated with a unique mark — the Global Transaction ID. Just pick the slave with the most advanced GTID to be the next master, and just CHANGE MASTER TO MASTER_HOST=’chosen_slave’ on the other slaves, and everything magically works. A slave knows which GTID it has already processed, and can look that entry on its master’s binary logs, resuming replication on the one that follows.
How does that work? The master’s binary logs are searched for that GTID entry. I’m not sure how brute-force this is, since I’m aware of a subtlety which requires brute-force scan of all binary logs; I don’t actually know if it’s always like that.
Pseudo GTID
We can mimick that above, but our solution can’t be as fine grained. With the injection of Pseudo GTID we mark the binary log for unique entries. But instead of having a unique identifier for every entry, we have a unique identifier for every second, 10 seconds, or what have you, with otherwise normal, non-unique entries in between our Pseudo GTID entries.
Recognizing which slave is more up to date
Given two slaves, which is more up to date?
- If both replicate(d) from same master, a SHOW SLAVE STATUS comparison answers (safe method: wait till SQL thread catches up with broken IO thread, compare relay_master_log_file, exec_master_log_pos on both machines). This is the method by which the above “Candidate for master” message is produced.
- If one is/was descendent of the other, then obviously it is less advanced or equals its ancestor.
- Otherwise we’re unsure – still solvable via bi-directional trial & error, as explained later on.
For now, let’s assume we know which slave is more up to date (has received and executed more relay logs). Let’s call it S1, whereas the less up-to-date will be S2. This will make our discussion simpler.
Prerequisites
- We require a Pseudo GTID in place: a periodic injection of a known-to-be-unique query, and which we know how to intercept
- We require log_slave_updates on all slaves. We will need to parse the binary logs on slaves. I have little trust in the availability of relay logs: these are flushed, rotated and auto-erased all too quickly. The proposed solution does not require any daemon running on the MySQL servers themselves. There will be nothing to back up the relay logs, so I can’t trust these to exist. Binary logs, on the other hand, have expiry period in days, and so I can trust them to exist for a duration of a few minutes.
- Normal replication. Not multi threaded. Not multi-source.
The process of rematching slaves
S1 is more up to date, hence we want to make S2 a slave of S1. We expect the statements/entries found in S2‘s binary logs to exist in S1, in the same order, but somewhere back in the past, padded by additional entries (zero or more) that are not found in S2. Steps are:
- Find latest Pseudo-GTID entry in S2‘s logs. This can be done by iterating S2‘s binary logs newest to oldest. The first (time DESC) binary log where such entry is found is to be searched for the last entry (latest). Keep record of the binlog file2:pos2 coordinates.
- Take note of the exact entry made in the above. This is the unique value.
- Search said unique value in S1‘s binary logs. Since it is unique, your method of search is arbirtary, you just need to find it. Brute-force wise you start looking at newest binary log moving back in time. Not found? Unlikely, since this means the lag diff between S1 and S2 is as long as the binlog expiry. We will be handling with failures and with immediate actions; we can expect slave lags in the seconds or in the minutes – we don’t even consider the possibility where the entry is not found.
- Take note of the coordinates file1:pos1 in S1 where we found the unique value.
- We now iterate S2‘s binary logs starting with the Pseudo GTID file2:pos2. We expect to find each entry in S1‘s binary logs, successively, starting file1:pos1. We verify the entries in both servers are identical. Exceptions above could be:
- Meta-entries (start-of-log, end-of-log, shutdown), in which case we skip to the next entry (this is done in both S1 and S2)
- Local statements executed directly on either S1 or S2, such as ANALYZE TABLE or whatever, which make no impact on data — we may skip these
- Local, evil statements executed directly on the slaves,which make for data impact (INSERT, DELETE, …). We choose to fail the operation in such case
- After all entries in S2 (matched by entries in S1) are iterated, our S1 “cursor” now looks at the first statement that never made it to S2. This is file_win:pos_win, into which we will point S2.
- Or, we might find that upon iterating all entries in S2 we have exactly reached the end of binlog entries for S1: this means both S1 and S2 are actually in identical state. We point S2 into S1‘s next-binlog-position.
- This is in fact no different than the previous case, but of particular interest.
- Or, we might run out of entries in S1. No, we can’t, because our assumption was that S1 is more advanced than (or equally advanced as) S2. But this answers the question: “what if didn’t know in advance who’s more advanced?” (no pun intended). In such case we conclude S2 is actually more advanced than S1 and we can try the other way around.
That last bullet is of importance: if you have two slaves whose “family connection” is complex, you can still match one below the other; you may try one way and fail, then try the other way around and succeed.
Comparison of the events following the Pseudo-GTID is a good way of sanity checking (some meta-stuff should be ignored, like transaction IDs, table IDs, these can vary across servers), and builds up confidence in the correctness of the operation.
The codebase is actually complete and pushed; I’ll release a BETA version or orchestrator next week, that supports Pseudo GTID. Let me tell you, doing this kind of crazy stuff with visual feedback (of course command line is available) is very very cool.
5 thoughts on “Refactoring replication topology with Pseudo GTID”