orchestrator
3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:
Faster failure detection
Recall that orchestrator
uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator
now detects failure faster than before:
- A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every
5sec
, failure detection time dropped from7-10sec
to3-5sec
, keeping reliability. The reduction in time does not lead to increased false positives.
Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster
). - Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout,
orchestrator
now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some5sec
.
Faster master recoveries
Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.
With recent changes, orchestrator
is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator
is able to immediate promote it (i.e. run hooks, set read_only=0
etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.
This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.
Between faster detection and faster recoveries, we’re looking at some 10sec
reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec
in almost all cases, and < 15s
in optimal cases. Those times are measured on our failover tests.
We are working on reducing failover time unrelated to orchestrator
and hope to update soon.
Automated Pseudo-GTID
As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows.
There’s still many setups out there where GTID is not (yet?) deployed and enabled. However, Pseudo-GTID is often misunderstood, and though I’ve blogged and presented Pseudo-GTID many times in the past, I still find myself explaining to people the setup is simple and does not involve change to one’s topologies.
Well, it just got simpler. orchestrator
is now able to automatically inject Pseudo-GTID for you.
Say the word: "AutoPseudoGTID": true
, grant the necessary privilege, and your non-GTID topology is suddenly supercharged with magical Pseudo-GTID tokens that provide you with:
- Arbitrary relocation of replicas
- Automated or manual failovers (masters and intermediate masters)
- Vendor freedom: runs on Oracle MySQL, Percona Server, MariaDB, or all of the above at the very same time.
- Version freedom (still on
5.5
? No problem. Oh, this gets you crash-safe replication as extra bonus, too)
Auto-Pseudo-GTID further simplifies the infrastructure in that you no longer need to take care of injecting Pseudo-GTID onto the master as well as handle master identity changes. No more event_scheduler
to enable/disable nor services to start/stop
.
More and more setups are moving to GTID. We may, too! But I find it peculiar that Pseudo-GTID was suggested 4
years ago, when 5.6
GTID was already released, and still many setups are not yet running GTID. If you’re not using GTID, please try Pseudo-GTID! Read more.
Semi-sync support
Semi-sync has been internally supported via a specialized patch contributed by Vitess, to flag a server as semi-sync-able and handle enablement of semi-sync upon master failover.
orchestrator
now supports semi-sync more generically. You may use orchestrator
to enable/disable semi-sync master/replica side, via orchestrator -c enable-semi-sync-master
, orchestrator -c enable-semi-sync-replica
, orchestrator -c disable-semi-sync-master
, orchestrator -c disable-semi-sync-replica
commands (or API equivalent).
The API will also tell you whether semi-sync is enabled on instances. Noteworthy that configured != enabled. A server can be configured with rpl_semi_sync_master_enabled=ON
, but if no semi-sync replicas are found, the Rpl_semi_sync_master_status
state is OFF
.
More
UI changes, removal of prepared statements, documentation updates, raft updates…
orchestrator
is free and open source and released under the Apache 2 license. It is authored at and used by GitHub.
I’ll be presenting orchestrator/raft
in FOSDEM next week, at the MySQL and Friends Room.
One thought on “orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more”