orchestrator
3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:
Faster failure detection
Recall that orchestrator
uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator
now detects failure faster than before:
- A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every
5sec
, failure detection time dropped from 7-10sec
to 3-5sec
, keeping reliability. The reduction in time does not lead to increased false positives.
Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster
).
- Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout,
orchestrator
now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec
.
Faster master recoveries
Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.
With recent changes, orchestrator
is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator
is able to immediate promote it (i.e. run hooks, set read_only=0
etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.
This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.
Between faster detection and faster recoveries, we’re looking at some 10sec
reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec
in almost all cases, and < 15s
in optimal cases. Those times are measured on our failover tests.
We are working on reducing failover time unrelated to orchestrator
and hope to update soon.
Automated Pseudo-GTID
As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows. Continue reading » “orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more”