orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more

orchestrator 3.0.6 is released and includes some exciting improvements and features. It quickly follows up on 3.0.5 released recently, and this post gives a breakdown of some notable changes:

Faster failure detection

Recall that orchestrator uses a holistic approach for failure detection: it reads state not only from the failed server (e.g. master) but also from its replicas. orchestrator now detects failure faster than before:

  • A detection cycle has been eliminated, leading to quicker resolution of a failure. On our setup, where we poll servers every 5sec, failure detection time dropped from 7-10sec to 3-5sec, keeping reliability. The reduction in time does not lead to increased false positives.
    Side note: you may see increased not-quite-failure analysis such as “I can’t see the master” (UnreachableMaster).
  • Better handling of network scenarios where packets are dropped. Instead of hanging till TCP timeout, orchestrator now observes server discovery asynchronously. We have specialized failover tests that simulate dropped packets. The change reduces detection time by some 5sec.

Faster master recoveries

Promoting a new master is a complex task which attempts to promote the best replica out of the pool of replicas. It’s not always the most up-to-date replica. The choice varies depending on replica configuration, version, and state.

With recent changes, orchestrator is able to to recognize, early on, that the replica it would like to promote as master is ideal. Assuming that is the case, orchestrator is able to immediate promote it (i.e. run hooks, set read_only=0 etc.), and run the rest of the failover logic, i.e. the rewiring of replicas under the newly promoted master, asynchronously.

This allows the promoted server to take writes sooner, even while its replicas are not yet connected. It also means external hooks are executed sooner.

Between faster detection and faster recoveries, we’re looking at some 10sec reduction in overall recovery time: from moment of crash to moment where a new master accepts writes. We stand now at < 20sec in almost all cases, and < 15s in optimal cases. Those times are measured on our failover tests.

We are working on reducing failover time unrelated to orchestrator and hope to update soon.

Automated Pseudo-GTID

As reminder, Pseudo-GTID is an alternative to GTID, without the kind of commitment you make with GTID. It provides similar “point your replica under any other server” behavior GTID allows. Continue reading » “orchestrator 3.0.6: faster crash detection & recoveries, auto Pseudo-GTID, semi-sync and more”

Implementing non re-entrant functions in Golang

A non re-entrant function is a function that could only be executing once at any point in time, regardless of how many times it is being invoked and by how many goroutines.

This post illustrates blocking non re-entrant functions and yielding non re-entrant functions implementations in golang.

A use case

A service is polling for some conditions, monitoring some statuses once per second. We want each status to be checked independently of others without blocking. An implementation might look like:

func main() {
    tick := time.Tick(time.Second)
    go func() {
        for range tick {
            go CheckSomeStatus()
            go CheckAnotherStatus()
        }
    }()
}

We choose to run each status check in its own goroutine so that CheckAnotherStatus() doesn’t wait upon CheckSomeStatus() to complete.

Each of these checks typically take a very short amount of time, and much less than a second. What happens, though, if CheckAnotherStatus() itself takes more than one second to run? Perhaps there’s an unexpected network or disk latency affecting the execution time of the check.

Does it make sense for the function to be executed twice at the same time? If not, we want it to be non re-entrant. Continue reading » “Implementing non re-entrant functions in Golang”