orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more

January 29, 2018

orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:

Faster failure detection

Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator now detects failure faster than before:

  • A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every 5sec, failure detection time dropped from 7-10sec to 3-5sec, keeping reliability. The reduction in time does not lead to increased false positives.
    Side note: you may see increased not-quite-failure analysis such as "I can't see the master" (UnreachableMaster).
  • Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, orchestrator now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec.

Faster master recoveries

Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It's not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.

With recent changes, orchestrator is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator is able to immediate promote it (i.e. run hooks, set read_only=0 etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.

This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.

Between faster detection and faster recoveries, we're looking at some 10sec reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec in almost all cases, and < 15s in optimal cases. Those times are measured on our failover tests.

We are working on reducing failover time unrelated to orchestrator and hope to update soon.

Automated Pseudo-GTID

As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar "point your replica under any other server" behavior GTID allows.

There's still many setups out there where GTID is not (yet?) deployed and enabled. However, Pseudo-GTID is often misunderstood, and though I've blogged and presented Pseudo-GTID many times in the past, I still find myself explaining to people the setup is simple and does not involve change to one's topologies.

Well, it just got simpler. orchestrator is now able to automatically inject Pseudo-GTID for you.

Say the word: "AutoPseudoGTID": true, grant the necessary privilege, and your non-GTID topology is suddenly supercharged with magical Pseudo-GTID tokens that provide you with:

  • Arbitrary relocation of replicas
  • Automated or manual failovers (masters and intermediate masters)
  • Vendor freedom: runs on Oracle MySQL, Percona Server, MariaDB, or all of the above at the very same time.
  • Version freedom (still on 5.5? No problem. Oh, this gets you crash-safe replication as extra bonus, too)

Auto-Pseudo-GTID further simplifies the infrastructure in that you no longer need to take care of injecting Pseudo-GTID onto the master as well as handle master identity changes. No more event_scheduler to enable/disable nor services to start/stop.

More and more setups are moving to GTID. We may, too! But I find it peculiar that Pseudo-GTID was suggested 4 years ago, when 5.6 GTID was already released, and still many setups are not yet running GTID. If you're not using GTID, please try Pseudo-GTID! Read more.

Semi-sync support

Semi-sync has been internally supported via a specialized patch contributed by Vitess, to flag a server as semi-sync-able and handle enablement of semi-sync upon master failover.

orchestrator now supports semi-sync more generically. You may use orchestrator to enable/disable semi-sync master/replica side, via orchestrator -c enable-semi-sync-master, orchestrator -c enable-semi-sync-replica, orchestrator -c disable-semi-sync-master, orchestrator -c disable-semi-sync-replica commands (or API equivalent).

The API will also tell you whether semi-sync is enabled on instances. Noteworthy that configured != enabled. A server can be configured with rpl_semi_sync_master_enabled=ON, but if no semi-sync replicas are found, the Rpl_semi_sync_master_status state is OFF.

More

UI changes, removal of prepared statements, documentation updates, raft updates...

orchestrator is free and open source and released under the Apache 2 license. It is authored at and used by GitHub.

I'll be presenting orchestrator/raft in FOSDEM next week, at the MySQL and Friends Room.

 
Powered by Wordpress and MySQL. Theme by openark.org