Orchestrator progress

This comes mostly to reassure, having moved into GitHub: orchestrator development continues.

I will have the privilege of working on this open source solution in GitHub. There are a few directions we can take orchestrator to, and we will be looking into the possibilities. We will continue to strengthen the crash recovery process, and in fact I’ve got a couple ideas on drastically shortening Pseudo-GTID recovery time as well as other debts. We will look into yet other directions, which we will share. My new and distinguished team will co-work on/with orchestrator and will no doubt provide useful and actionable input.

Orchestrator continues to be open for pull requests, with a temporal latency in response time (it’s the Holidays, mostly).

Some Go(lang) limitations (namely the import path, I’ll blog more about it) will most probably imply some changes to the code, which will be well communicated to existing collaborators.

Most of all, we will keep orchestrator a generic solution, while keeping focus on what we think is most important – and there’s some interesting vision here. Time will reveal as we make progress.

 

Joining GitHub

Today was my last day at Booking.com, and shortly I will be joining the team at GitHub.

I’d like to thank the many kind, friendly & smart people I’ve worked with at Booking.com!

The challenges at Booking.com are big. There is such a diversity within the technology stack; even within the database range. A solution that works on all the various Booking.com production environments is something to value. Indeed, the Booking.com Production environment it is an amazing playground for developers, offering high volume, large numbers, and differing workloads to tackle. Your code just gets hammered down and you get very quick feedback on whether you did it right or wrong.

I was happy to have worked on serious reliability and operational topics, and to have made a meaningful contribution.

Joining GitHub, I’m to be a systems engineer in a great team (friends included), building great products, in and around the database zone, delivering open source, pretty much expecting to do awesome stuff! That, and the swag.

Forking Golang repositories on GitHub and managing the import path

Problem: there’s an awesome Golang project on GitHub which you want to fork. You want to develop & collaborate on that fork, but the golang import path, in your source code, still references the original path, breaking everything.

A couple solutions offered below. First, though, let’s get some names.

A sample case, the problem at hand

There’s an awesome tool on http://github.com/awsome-org/tool. You successfully fork it onto http://github.com/awesome-you/tool.

You want to collaborate on http://github.com/awesome-you/tool; you wish to pull, commit & push. Maybe you want to send pull requests to the origin.

The following is commonly found throughout .go files in the repository:

import (
    "github.com/awesome-org/tool/config"
    "github.com/awesome-org/tool/driver"
    "github.com/awesome-org/tool/net"
    "github.com/awesome-org/tool/util"
)

If you:

go get http://github.com/awesome-you/tool

golang creates your $GOPATH/src/github.com/awesome-you/tool/, which is awesome. However, as you resolve dependencies via

cd $GOPATH/src/github.com/awesome-you/tool/ ; go get ./...

golang digs into the source code, finds references to github.com/awesome-org/tool/configgithub.com/awesome-org/tool/driver etc, and fetches those from http://github.com/awsome-org/tool and onto $GOPATH/src/github.com/awesome-org/tool/, which is not awesome. You actually have two copies of the code, one from your fork, one from the origin, and your own fork will be largely ignored as it mostly points back to the origin.

A bad solution

The dirty, bad solution would be for you to go over the source code and replace “github.com/awesome-org/tool” entries with “github.com/awesome-you/tool”. It is bad for two reasons:

  • You will not be able to further pull changes from upstream
  • You will not be able to pull-request and push your own changes upstream

Continue reading » “Forking Golang repositories on GitHub and managing the import path”

State of automated recovery via Pseudo-GTID & Orchestrator @ Booking.com

This post sums up some of my work on MySQL resilience and high availability at Booking.com by presenting the current state of automated master and intermediate master recoveries via Pseudo-GTID & Orchestrator.

Booking.com uses many different MySQL topologies, of varying vendors, configurations and workloads: Oracle MySQL, MariaDB, statement based replication, row based replication, hybrid, OLTP, OLAP, GTID (few), no GTID (most), Binlog Servers, filters, hybrid of all the above.

Topologies size varies from a single server to many-many-many. Our typical topology has a master in one datacenter, a bunch of slaves in same DC, a slave in another DC acting as an intermediate master to further bunch of slaves in the other DC. Something like this, give or take:

booking-topology-sample

However as we are building our third data center (with MySQL deployments mostly completed) the graph turns more complex.

Two high availability questions are:

  • What happens when an intermediate master dies? What of all its slaves?
  • What happens when the master dies? What of the entire topology?

This is not a technical drill down into the solution, but rather on overview of the state. For more, please refer to recent presentations in September and April.

At this time we have:

  • Pseudo-GTID deployed on all chains
  • Pseudo-GTID based automated failover for intermediate masters on all chains
  • Pseudo-GTID based automated failover for masters on roughly 30% of the chains.
    • The rest of 70% of chains are set for manual failover using Pseudo-GTID.

Pseudo-GTID is in particular used for:

  • Salvaging slaves of a dead intermediate master
  • Correctly grouping and connecting slaves of a dead master
  • Routine refactoring of topologies. This includes:
    • Manual repointing of slaves for various operations (e.g. offloading slaves from a busy box)
    • Automated refactoring (for example, used by our automated upgrading script, which consults with orchestrator, upgrades, shuffles slaves around, updates intermediate master, suffles back…)
  • (In the works), failing over binlog reader apps that audit our binary logs.

Continue reading » “State of automated recovery via Pseudo-GTID & Orchestrator @ Booking.com”

Orchestrator & Pseudo-GTID for binlog reader failover

One of our internal apps at Booking.com audits changes to our tables on various clusters. We used to use tungsten replicator, but have since migrated onto our own solution.

We have a binlog reader (uses open-replicator) running on a slave. It expects Row Based Replication, hence our slave runs with log-slave-updates, binlog-format=’ROW’, to translate from the master’s Statement Based Replication. The binlog reader reads what it needs to read, audits what it needs to audit, and we’re happy.

However what happens if that slave dies?

In such case we need to be able to point our binlog reader to another slave, and it needs to be able to pick up auditing from the same point.

This sounds an awful lot like slave repointing in case of master/intermediate master failure, and indeed the solutions are similar. However our binlog reader is not a real MySQL server and does not understands replication. It does not really replicate, it just parses binary logs.

We’re also not using GTID. But we are using Pseudo-GTID. As it turns out, the failover solution is already built in by orchestrator, and this is how it goes:

Normal execution

Our binlog app reads entries from the binary log. Some are of interest for auditing purposes, some are not. An occasional Pseudo-GTID entry is found, and is being stored to ZooKeeper tagged as  “last seen and processed Pseudo-GTID”.

Upon slave failure

We recognize the death of a slave; we have other slaves in the pool; we pick another. Now we need to find the coordinates from which to carry on.

We read our “last seen and processed Pseudo-GTID”. Say it reads:

drop view if exists `meta`.`_pseudo_gtid_hint__asc:56373F17:00000000012B1C8B:50EC77A1`

. We now issue:

$ orchestrator -c find-binlog-entry -i new.slave.fqdn.com --pattern='drop view if exists `meta`.`_pseudo_gtid_hint__asc:56373F17:00000000012B1C8B:50EC77A1`'

The output of such command are the binlog coordinates of that same entry as found in the new slave’s binlogs:

binlog.000148:43664433

Pseudo-GTID entries are only injected once every few seconds (5 in our case). Either: Continue reading » “Orchestrator & Pseudo-GTID for binlog reader failover”

Thoughts on MaxScale automated failover (and Orchestrator)

Having attended a talk (as part of the MariaDB Developer Meeting in Amsterdam) about recent developments of MaxScale in executing automated failovers, here are some (late) observations of mine.

I will begin by noting that the project is stated to be pre-production, and so of course none of the below are complaints, but rather food for thought, points for action and otherwise recommendations.

Some functionality of the MaxScale failover is also implemented by orchestrator, which I author. Orchestrator was built in production environments by and for operational people. In this respect it has gained many insights and had to cope with many real-world cases, special cases & Murphy’s law cases. This post compares logic, feature set and capabilities of the two where relevant. To some extent the below will read as “hey, I’ve already implemented this; shame to re-implement the same”, and indeed I think that way; but it wouldn’t be the first time a code of mine would just be re-implemented by someone else and I’ve done the same, myself.

I’m describing the solution the way I understood it from the talk. If I’m wrong on any account I’m happy to be corrected via comments below. Edit: please see comment by Dipti Joshi

General overview

The idea is that MaxScale operates as a proxy to your topology. You do not connect to your master directly, but rather through MaxScale. Thus, MaxScale acts as a proxy to your master.

The next phase is that MaxScale would also auto-detect master failure, fix the topology for you, promote a new master, and will have your application unaware of all the complexity and without the app having to change setup/DNS/whatever. Of course some write downtime is implied.

Now for some breakdown.

Detection

The detection of a dead master, the check by which a failover is initiated, is based on MaxScale not being able to query the master. This calls for some points for consideration:

  • Typically, I would see “I can’t connect to the master therefore failover” as too hysterical, and the basis for a lot of false positives.
  • However, since in the discussed configuration MaxScale is the only access point to the master, the fact MaxScale cannot connect to the master means the master is inaccessible de-facto.
  • In light of the above, the decision makes sense – but I still hold that it would make false positives.
  • I’m unsure (I think not; can anyone comment?) if MaxScale would make multiple attempts over time and only reach the conclusion after X successive failures. This would reduce the false positives.
  • I’m having a growing dislike to a “check for 4 successive times then alert/failover” Nagios-style behavior. Orchestrator takes a different approach where it recognizes a master’s death by not being able to connect to the master as well as being able to connect to 1st tier slaves, check their status and observe that they’re unable to connect to the master as well. See What makes a MySQL server failure/recovery case?. This approach still calls for further refinement (what if the master is temporarily deadlocked? Is this a failover or not?).

Continue reading » “Thoughts on MaxScale automated failover (and Orchestrator)”

SQL mini hack of the day, inverted IN clause

We are used to issue queries with an IN clause of the form:

... where state in ('started', 'completed', 'failed') ...

However I’ve had a few cases where I used an inverted format. Here’s one use case followed by an inverted IN clause.

Dynamic query building

Say we have this function:

GetLaggingSlaves(clusterName string)

Which, based on whether given clusterName is empty or not, would return list of all lagging slaves, or only those in the given cluster, respectively: Continue reading » “SQL mini hack of the day, inverted IN clause”

Leader election using MySQL

Being a stateful, centralized datastore, MySQL can serve in negotiating leadership: a mechanism to elect a single service out of multiple services; moreover, a mechanism to promote a new leader should the existing leader cease to function.

What of Zookeeper?

Zookeeper makes for an excellent leader election mechanism. This is one of the most recognized uses for Zookeeper. It has HA via multiple nodes & quorum,  ephemeral nodes, all you need. To achieve similar benefits with MySQL you’d need to use Galera or NDB Cluster; so why not use Zk? The use case at hand is

orchestrator, a multi-node, mostly stateless service that happens to use MySQL as backend datastore. Ir relies on MySQL to exist in backend. It already expects it to be there. If the MySQL server is down, so is the service, effectively. In such case it doesn’t hurt adding another dependency on MySQL; this does not reduce HA. You need to take care of MySQL HA anyhow so there’s no additional cost. In fact, going to Zookeeper makes the additional cost as you introduce a new component to the system that can be avoided.

Terms of the solution

Our proposed solution offers:

  • Single leader election out of multiple nodes
  • Leader actively reaffirms its leadership periodically
  • Timeout based re-election: decision to re-elect new leader based on the fact current leader has not reaffirmed its leadership over X seconds
  • A way to forcibly assume leadership for a specific node
  • A way to forcibly call for re-elections by demoting existing leader
  • A node/service can easily tell whether it’s the leader or not
  • Anyone can tell who the leader is

SQL solution

The solution is composed of a single table and a set of queries which implement the above offers. We assume a service can uniquely identify itself; this is easy to achieve:

  • If services are running from different hosts (as should be the case, this is service HA), use hostname for ID
    • But what if the service restarts? Are you good with calling this “the same service” or is this now a new service running on the same host?
  • In such case use combination of hostname & OS process ID
    • Or generate a random token upon startup
    • Or use startup timestamp Whichever solution you pick, make sure it is human readable, such that it is easy to tell

which service is the leader. This helps operations. We note this as service_id

Table

The following table will have a single row; the

service_id in that row is the active leader.

CREATE TABLE service_election ( 
  anchor tinyint(3) unsigned NOT NULL, 
  service_id varchar(128) NOT NULL, 
  last_seen_active timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, 
  PRIMARY KEY (anchor) 
) ENGINE=InnoDB 

Attempt election

Continue reading » “Leader election using MySQL”

New statements I’d like to see in MySQL 5.8

Following up on New features I’d like to see in MySQL 5.8, here are statements I would like to see in MySQL 5.8:

  • ENABLE EVENTS;
    When promoting a slave to master, I want to be able to enable all those events that are in SLAVESIDE_DISABLED state. Today I script an iteration over the events an constructing the ALTER EVENT…ENABLE statement one by one. Just activate those!
  • SKIP GTID TRANSACTION;
    I have a transaction on slave that I want to skip, and there’s GTID. The sequence of

    STOP SLAVE;
    SET GTID_NEXT="...";
    BEGIN;
    COMMIT;
    SET GTID_NEXT="AUTOMATIC";
    START SLAVE;

    is just something I don’t want to do. To compute the GTID_NEXT; to open a transaction; to use session variables; this may seem straightforward to import the above from a shell script, but calculating the next GTID is not entirely trivial; issuing the above from your programming language makes for a weird “all these have to be in the same session AND you’re going to do a transaction meanwhile”. With golang it’s actually a problem.
    Make it simple for me. I’m willing to do the STOP/START SLAVE.

  • BINLOG ENTRY ‘<arbitrary text>’;
    Very selfishly, I want to be able to inject a comment into the binary log, of arbitrary text. I want this comment to appear in SBR format, as if it were a DDL.
    My selfish reason: injection of Pseudo-GTID. But I can see various other use cases, such as application level injection of app-logic checkpointing; chef/glu injection of “code deployed at this time”; application injection of “daily audit done to this point”. This is too cool and too easy to skip.
  • SHOW RELAY LOGS;
    Similar to SHOW BINARY LOGS;
  • PURGE RELAY LOGS TO ‘…’;
    Similar to PURGE BINARY LOGS TO ‘…’;
    It’s time relay logs stopped being 2nd class citizens.
  • SHOW NONBLOCKING [GLOBAL|SESSION] STATUS;
    Issue a SHOW GLOBAL|SESSION STATUS query that only shows those variables for which it does not need to block. i.e. this is a safe, fast “show me everything you’ve got that I won’t need to pay for”.

Yes, yes, statements are also features, I know.

 

New features I’d like to see in MySQL 5.8

Following up on Morgan Tocker’s What would you like to see in MySQL 5.8?, having attended and participated at the brainstorming at Percona Live Amsterdam, and publishing this post while failing to comply with any of Morgan’s suggested media, these are the features I would like to see in MySQL 5.8:

  • Dynamicly enable/disable log-bin and log-slave-updates
    Today, when changing chef/puppet role of a server from a simple slave to an intermediate master and vice versa, a MySQL restart is required. This is a very big pain which makes replication automation complex, not to mention warmup times.
  • nice“.
    I want to be able to execute a query that is nice, i.e has lower priority; will not consume all resources; will stall/throttle so as to allow other queries to complete. Luis asked and I said this could be on a per statement basis, e.g. add a SQL_NICE query hint. But I’m unsure that would be the correct behavior. It also makes sense to do so on a per connection basis (perhaps provide connection attributed to hint niceness?).
  • Online-ier ALTER TABLE. I would in particular want it to apply the nice feature, above. Otherwise throttle by user defined metrics.
  • Online-ier ALTER TABLE in replication stream.  Can the slaves run the ALTER statement in parallel?
  • Re-Group Commit: in MTS, and when intermediate masters involved, copy+paste the group commit as applied on master as working downstream. I suspect this is easily achievable. The result: same parallelism for replicating slaves in all levels, whether they replicate directly from master or from 2nd, 3rd tier intermediate masters. Today parallelism decreases as we go downstream.
  • Global user-defined-variables. I want to be able to define arbitrary (global) variables that I can later query via SELECT @@global.arbitrary. This would be similar to HTML 5‘s “data-*” attributes. I often wish I could tell & ask MySQL my puppet role; or the server status (is it live? Is it offline? Does it belong to a specific pool? etc.). Similar to “loose-*” syntax, this could be a “data-*” or “user-*” name prefix system.

I will follow up on new statements I would like to see in MySQL 5.8.

The brainstorming session at PerconaLive, I should note, was pure joy, and apart from getting two nice furry dolphins I enjoyed the engagement, the diversity of ideas, and the fact Oracle engineers (Mark in particular) were very busy taking notes or otherwise openly discussing the viability of some requested features.