Comments on: What’s so complicated about a master failover? https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover Blog by Shlomi Noach Tue, 12 Sep 2017 18:29:00 +0000 hourly 1 https://wordpress.org/?v=5.3.3 By: Shlomi Noach https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover/comment-page-1#comment-401193 Tue, 12 Sep 2017 18:29:00 +0000 https://shlomi-noach.github.io/blog/?p=7726#comment-401193 @Dane,

Over the course of the past year, we have migrated our MySQL clusters from statement based replication to row based replication, and from 5.6 to 5.7.
These migrations were gradual. The migration from 5.6 to 5.7 took months of experiments, measurements and testing.
We slowly added more and more servers, and when we were convinced this would work out we did a marathon of upgrades for the remaining servers.
orchestrator allows us to concentrate on what we want to do without having to worry about having each upgrade hard coded in some config file to indicate whether the server should or should not participate in a failover path.
We had a couple cases where we downgraded servers from 5.7 back to 5.6 because we were getting inconsistent metrics. To require each such upgrade/downgrade to be manually followed by a config change would have been quite the hassle.

The thing is, we’re really not a big shop for MySQL. At a little over 100 servers we’re a busy, but not huge shop.
Previously I worked at Booking.com with X,000 MySQL servers. The mere thought of having anything manual at the single server level was excruciating. We had to have as much automation as possible “or else”.

5.6-5.7, sbr-rbr, special roles (delayed, backup, dev)… those are all real world scenarios.

Your preference is to require homogenous mysql clusters by default. “By default” means a time where “nothing special happens” – and that’s exactly where I see the difference. We almost always have something special happening.

The costs associated with building this complexity into orchestrator are: development time ; code complexity ; behavior complexity.

The failover logic is not “beautiful code”. It is filled with conditions and use cases that make sense in a reality. Some scenarios are yet challenging the orchestrator logic and we need to adapt.

I can share a public statement by Booking.com that the experience a failover every day, and orchestrator does the right thing every day (and sometimes it does not do the best thing, and I’ve seen some crazy scenarios there).
The assumptions made before orchestrator was in the game were limiting to the deployments, to the hardware, to outage time.

With orchestrator, and this isn’t a sales pitch, you are more free to choose your preferred topology/technology/setup and just assume orchestrator will deal with it.

Anyway those are my thoughts for the day. Reading the above it really sounds like marketing. Not everything is great & wonderful & pretty, but I think the dynamic approach (as opposed to configuration based approach) is a huge step and a relief to engineers.

Thank you

]]>
By: Dane Miller https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover/comment-page-1#comment-401176 Tue, 12 Sep 2017 16:21:00 +0000 https://shlomi-noach.github.io/blog/?p=7726#comment-401176 I’m skeptical that automation makes sense for some of the mixed replication setups you describe. Mix of 5.6/5.7, mix of statement/rbr. Does it make sense to make failover decisions based on majority/quorum concepts for mysql async replication? I tend to think not, but I suspect I haven’t seen nearly the number of failovers as you.

My preference, and how we operate at my site, is to require homogenous mysql clusters by default. Any mysql version or replication discrepancies trigger alerts. That doesn’t mean you can’t run with a mix of hosts/versions/settings, it just means we must explicitly pick a homogenous set of hosts that will participate in automated failover. The rest will be orphaned after a failover, and require manual cleanup.

Just curious, have you found benefits to automating failover for this genre of complex mixed configuration mysql replication topologies? Could you share some examples? Have you experienced any costs associated with building this complexity into orchestrator?

As always, thanks for all your work and posts on orchestrator, ghost, and mysql. I’m a big fan!
Dane

]]>
By: Shlomi Noach https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover/comment-page-1#comment-394717 Thu, 29 Jun 2017 17:24:00 +0000 https://shlomi-noach.github.io/blog/?p=7726#comment-394717 Hi Rick,
I hinted at split brain with “A DC is network partitioned…” ; An orchestrator PR to handle split brain is https://github.com/github/orchestrator/pull/183

]]>
By: Rick James https://shlomi-noach.github.io/blog/mysql/whats-so-complicated-about-a-master-failover/comment-page-1#comment-394712 Thu, 29 Jun 2017 15:56:00 +0000 https://shlomi-noach.github.io/blog/?p=7726#comment-394712 What about “split brain”?

]]>