orchestrator 3.0.2 GA released: raft consensus, SQLite

orchestrator 3.0.2 GA is released and available for download (see also packagecloud repository).

3.0.2 is the first stable release in the 3.0* series, introducing (recap from 3.0 pre-release announcement):

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLiteis supported. orchestrator now ships with SQLite embedded, no external dependency needed.

For details, please refer to the documentation:

SQLite

Suggested and requested by many, is to remove orchestrator‘s own dependency on a MySQL backend. orchestrator now supports a SQLite backend.

SQLite is a transactional, relational, embedded database, and as of 3.0 it is embedded within orchestrator, no external dependency required.

orchestrator-client

orchestrator-client is a client shell script which mimics the command line interface, while running curl | jq requests against the HTTP API. It stands to simplify your deployments: interacting with the orchestrator service via orchestrator-client is easier and only requires you to place a shell script (this is as opposed to installing the orchestrator binary + configuration file).

orchestrator-client is the way to interact with your orchestrator/raft cluster. orchestrator-client now has its own RPM/deb release package.

You may still use the web interface, web API ; and a special --ignore-raft-setup keeps power at your hand (use at your own risk).

State of orchestrator/raft

orchestrator/raft is a big change: Continue reading » “orchestrator 3.0.2 GA released: raft consensus, SQLite”

Speaking at August Penguin, MySQL Track, GitHub sponsored

This Thursday I’ll be presenting at August Penguin, conveniently taking place September 7th, 8th, Ramat Gan, Israel.

I will be speaking as part of the MySQL track, 2nd half of Thursday. The (Hebrew) schedule is here.

My talk is titled Reliable failovers, safe schema migrations: open source solutions to MySQL problems. I will describe some of the open source MySQL infrastructure work we run at GitHub ; how it solves reliability, availability and usability. I’ll describe some of our internal workflows and our use of chat and chatops.

I’m proud to announce GitHub sponsors the event. We won’t have a booth, but please do grab me in the hallways or over lunch to chat!

And, yes, octocat stickers will be made available 🙂

 

orchestrator/raft: Pre-Release 3.0

orchestrator 3.0 Pre-Release is now available. Most notable are Raft consensus, SQLite backend support, orchestrator-client no-binary-required client script.

TL;DR

You may now set up high availability for orchestrator via raft consensus, without need to set up high availability for orchestrator‘s backend MySQL servers (such as Galera/InnoDB Cluster). In fact, you can run a orchestrator/raft setup using embedded SQLite backend DB. Read on.

orchestrator still supports the existing shared backend DB paradigm; nothing dramatic changes if you upgrade to 3.0 and do not configure raft.

orchestrator/raft

Raft is a consensus protocol, supporting leader election and consensus across a distributed system.  In an orchestrator/raft setup orchestrator nodes talk to each other via raft protocol, form consensus and elect a leader. Each orchestrator node has its own dedicated backend database. The backend databases do not speak to each other; only the orchestrator nodes speak to each other.

No MySQL replication setup needed; the backend DBs act as standalone servers. In fact, the backend server doesn’t have to be MySQL, and SQLite is supported. orchestrator now ships with SQLite embedded, no external dependency needed. Continue reading » “orchestrator/raft: Pre-Release 3.0”

gh-ost 1.0.17: Hooks, Sub-second lag control, Amazon RDS and more

gh-ost version 1.0.17 is now released, with various additions and fixes. Here are some notes of interest:

Hooks

gh-ost now supports hooks. These are your own executables that gh-ost will invoke at particular points of interest (validation pass, about to cut-over, success, failure, status, etc.)

gh-ost will set various environment variables for your executables to pick up, passing along such information as migrated/ghost table name, elapsed time, processed rows, migrated host etc.

Sub-second lag control

At GitHub we’re very strict about replication lag. We keep it well under 1 second at most times. gh-ost can now identify sub-second lag on replicas (well, you need to supply with the right query). Our current production migrations are set by default with --max-lag-millis=500 or less, and our most intensive migrations keep replication lag well below 1sec or even below 500ms

No SUPER

The SUPER privilege is required to set global binlog_format='ROW' and for STOP SLAVE; START SLAVE;

If you know your replica has RBR, you can pass --assume-rbr and skips those steps.

RDS

Hooks + No Super = RDS, as seems to be the case. For --test-on-replica you will need to supply your own gh-ost-on-stop-replication hook, to stop your RDS replica at cut-over phase. See this tracking issue Continue reading » “gh-ost 1.0.17: Hooks, Sub-second lag control, Amazon RDS and more”

Introducing gh-ost: triggerless online schema migrations

I’m thoroughly happy to introduce gh-ost: triggerless, controllable, auditable, testable, trusted online schema change tool released today by GitHub.

gh-ost now powers our production schema migrations. We hit some serious limitations using pt-online-schema-change on our large volume, high traffic tables, to the effect of driving our database to a near grinding halt or even to the extent of causing outages. With gh-ost, we are now able to migrate our busiest tables at any time, peak hours and heavy workloads included, without causing impact to our service.

gh-ost supports testing in production. It goes a long way to build trust, both in integrity and in control. Are your databases just too busy and you cannot run existing online-schema-change tools? Have you suffered outages due to migrations? Are you tired of babysitting migrations that run up to 3:00am? Tired of being the only one tailing logs? Please, take a look at gh-ost. I believe it changes online migration paradigm.

For a more thorough overview, please read the announcement on the GitHub Engineering Blog, and proceed to the documentation.

gh-ost is open sourced under the MIT license.

Solving the non-atomic table swap, Take III: making it atomic

With the unintended impression of becoming live blogging, we now follow up on Solving the non-atomic table swap, Take II and Solving the Facebook-OSC non-atomic table swap problem with a safe, blocking, atomic solution

Why yet another iteration?

The solution presented in Solving the non-atomic table swap, Take II was good, in that it was safe. No data corruption. Optimistic: if no connection is killed throughout the process, then completely blocking.

Two outstanding issues remained:

  • If something did go wrong, the solution reverted to a table-outage
  • On replicas, the table swap is non atomic, non blocking. There’s table-outage scenario on replica.

As it turns out, there’s a simpler solution which overcomes both the above. As with math and physics, the simpler solution is often the preferred one. But it took those previous iterations to gather a few ideas together. So, anyway:

Safe, locking, atomic, asynchronous table swap

Do read the aforementioned previous posts; the quick-quick recap is: we want to be able to LOCK a table tbl, then do some stuff, then swap it out and put some ghost table in its place. MySQL does not allow us to rename tbl to tbl_old, ghost to tbl if we have locks on tbl in that session.

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). “Our” connections will be C10, C20. The “normal” app connections are C1..C9, C11..C19, C21..C29.

  • Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE
  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT=’magic-be-here’
  • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE
  • Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK
  • Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl
    This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
  • Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue
  • Connection C10: checks that C20’s RENAME is applied (looks for the blocked RENAME in processlist)
  • Connection 10: DROP TABLE tbl_old
    Nothing happens yet; tbl is still locked. All other connections still blocked.
  • Connection 10: UNLOCK TABLES
    BAM!
    The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes Continue reading » “Solving the non-atomic table swap, Take III: making it atomic”

Solving the non-atomic table swap, Take II

Following up and improving on Solving the Facebook-OSC non-atomic table swap problem, we present a better, safe solution.

Quick, quickest recap:

We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the FB osc and as opposed to the synchronous solution as used by pt-online-schema-change.

We asynchronously synchronize (is that even a valid statement?) between some table tbl and a ghost table ghost, and at some time we want to cut-over: swap the two; kick out tbl and put ghost in its place and under its name.

However, we cannot use the single statement rename tbl to tbl_old, ghost to tbl, because we use the asynchronous approach, where at the time we lock tbl for writes, we still have some events we need to process and apply onto ghost before swapping the two.

And MySQL does not allow a lock tables tbl write; … ; rename tbl to tbl_old, ghost to tbl.

In Solving the Facebook-OSC non-atomic table swap problem we suggested a way that works, unless when it doesn’t work. Read the caveat at the end of the post. Premature death of a connection that participates in the algorithm causes a chain reaction that leads to the premature execution of the rename statement, potentially before we’ve applied those remaining events. This leads to data inconsistency between the old table and the new table, and is unacceptable.

To that effect, we were more inclined to go with the Facebook solution, which makes a two-step: lock tables tbl write; alter table tbl rename to tbl_old; … ; alter table ghost rename to tbl;

This two-step solution is guaranteed not to have data inconsistency. Alas, it also implies an outage. There’s a brief moment, in between the two renames, and during that time where we apply those last changes, where the table tbl is simply not there.

Not all applications will fail gracefully on such a scenario. Continue reading » “Solving the non-atomic table swap, Take II”

Solving the Facebook-OSC non-atomic table swap problem

We present a way to use an atomic, blocking table swap in the Facebook Online-Schema-Change solution, as well as in a rumored, other Online-Schema-rumored-Change solution. Update: also a caveat.

Quick recap (really quick)

pt-online-schema-change and facebook-osc are two popular online-schema-change solutions for MySQL. They both use triggers, but in different ways. While the Percona tool uses synchronous table updates, such that any INSERT|UPDATE|DELETE on the modified table causes an INSERT|UPDATE|DELETE on a ghost table, in the Facebook tool all cause an INSERT on a changelog table, which is then iterated, read, having entries applied on the ghost table.

The TL;DR is that DMLs on the table propagate synchronously, within same transaction in the Percona tool, and asynchronously, with lag, in the Facebook tool.

What’s the problem with the table swap?

In the Percona tool, once the logic is satisfied the copy is complete, we issue this query:

RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

This is an atomic, two table RENAME operation.

However with the asynchronous nature of the Facebook tool, such a RENAME would be a mistake. We must first block writes to the modified table, then make sure we have iterated the changelog table to the point of lock, apply those changes onto the ghost table, and only then do the swap.

The problem is: you cannot RENAME TABLES while one of them is LOCKed.

This is silly, and inconsistent, because:

> LOCK TABLES tbl WRITE;
Query OK, 0 rows affected (0.00 sec)

> RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;
ERROR 1192 (HY000): Can't execute the given command because you have active locked tables or an active transaction

> ALTER TABLE tbl RENAME TO tbl_old;
Query OK, 0 rows affected (0.00 sec)

Why would the RENAME fail where the ALTER works?

Small thing, but critical to the operation of the online-schema-change. From the Facebook OSC documentation:

Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a ‘table not found’ error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.

What the FB solution means

It means for a very brief duration, the table is not there. Your app will get errors.

Of course, we should be able to handle errors anytime, aywhere. But the honest truth is: we (as in the world) do not. Many apps will fail ungracefully should they get a table not found error.

An atomic swap, as compared, would make for briefly blocking operations, making the app ignorant of the swap.

Rumor

Rumor has it that we at GitHub are developing a new, triggerless, Online Schema Change tool. It is rumored to be based off binary logs and is rumored to have lots of interesting rumored implications.

Such rumored implementation would have to be asynchronous by nature, or so rumors say. And as such, it would fall for the same non-atomic table swap problem.

Solution

Once we heard it was rumored we were working on a triggerless online schema change tool, we realized we would have to solve the non-atomic swap problem. What we did was to gossip about it in between ourselves, which led to three different rumors of a solution, eventually manifested as three different working solutions. All three solutions make for blocking queries on the app’s side. I will present one of these solution here, based on voluntary locks. Continue reading » “Solving the Facebook-OSC non-atomic table swap problem”

Orchestrator progress

This comes mostly to reassure, having moved into GitHub: orchestrator development continues.

I will have the privilege of working on this open source solution in GitHub. There are a few directions we can take orchestrator to, and we will be looking into the possibilities. We will continue to strengthen the crash recovery process, and in fact I’ve got a couple ideas on drastically shortening Pseudo-GTID recovery time as well as other debts. We will look into yet other directions, which we will share. My new and distinguished team will co-work on/with orchestrator and will no doubt provide useful and actionable input.

Orchestrator continues to be open for pull requests, with a temporal latency in response time (it’s the Holidays, mostly).

Some Go(lang) limitations (namely the import path, I’ll blog more about it) will most probably imply some changes to the code, which will be well communicated to existing collaborators.

Most of all, we will keep orchestrator a generic solution, while keeping focus on what we think is most important – and there’s some interesting vision here. Time will reveal as we make progress.

 

Joining GitHub

Today was my last day at Booking.com, and shortly I will be joining the team at GitHub.

I’d like to thank the many kind, friendly & smart people I’ve worked with at Booking.com!

The challenges at Booking.com are big. There is such a diversity within the technology stack; even within the database range. A solution that works on all the various Booking.com production environments is something to value. Indeed, the Booking.com Production environment it is an amazing playground for developers, offering high volume, large numbers, and differing workloads to tackle. Your code just gets hammered down and you get very quick feedback on whether you did it right or wrong.

I was happy to have worked on serious reliability and operational topics, and to have made a meaningful contribution.

Joining GitHub, I’m to be a systems engineer in a great team (friends included), building great products, in and around the database zone, delivering open source, pretty much expecting to do awesome stuff! That, and the swag.