Reducing my OSS involvement, and how it affects orchestrator & gh-ost

I’m going to bring down my work volume around OSS to a minimum, specifically when it comes to orchestrator and gh-ost. This is to explain the whats and hows so that users are as informed as possible. TL;DR a period of time I will not respond to issues, will not review pull requests, will not produce releases, will not answer on mailing lists. That period of time is undefined. Could be as short as a few weeks, could be months, more, an unknown.

The “What”

Both orchestrator and gh-ost are popular tools in the MySQL ecosystem. They enjoy widespread adoption and are known to be used at prominent companies. Time and again I learn of more users of these projects. I used to keep a show-off list, I lost track since.

With wide adoption comes community engagement. This comes in the form of questions (“How do I…”, “Why does this not work…”, “Is it possible to…”), issues (crashing or data integrity bugs, locking issues, performance issues, etc.), suggestions (support this or that) and finally pull requests.

At this time, there’s multiple engagements per day. Between these two projects I estimate more than a full time job addressing those user interactions. That’s a full time job volume on top of an already existing full time job.

Much of this work went on employer’s time, but I have other responsibilities at work, too, and there is no room for a full-time-plus work on these projects. Responding to all community requests is unsustainable and futile. Some issues are left unanswered. Some pull requests are left open.

Even more demanding than time is context. To address a user’s bug report I’d need to re-familiarize myself with 5-year old code. That takes the toll of time but also memory and context switch. As community interaction goes, a simple discussion on an Issue can span multiple days. During those days I’d jump in and out of context. With multiple daily engagements this would mean re-familiarizing myself with different areas of the code, being able to justify a certain behavior; or have good arguments to why we should or should not change it; being able to simulate a scenario in my brain (I don’t have access to users’ environments); comprehend potential scenarios and understand what could break as result of what change — I don’t have and can’t practically have the tests to cover the myriad of scenarios, deployments, software, network and overall infrastructure in all users environments.

Even if I set designated time for community work, this still takes a toll on my daily tasks. The need to have a mental projection in your brain for all that’s open and all that’s to come makes it harder to free my mind and work on a new problem, to really immerse myself in thought, to create something new.

When? For how long?

Continue reading » “Reducing my OSS involvement, and how it affects orchestrator & gh-ost”

The problem with MySQL foreign key constraints in Online Schema Changes

This post explains the inherent problem of running online schema changes in MySQL, on tables participating in a foreign key relationship. We’ll lay some ground rules and facts, sketch a simplified schema, and dive into an online schema change operation.

Our discussion applies to pt-online-schema-change, gh-ost, and Vitess based migrations, or any other online schema change tool that works with a shadow/ghost table like the Facebook tools.

Why Online Schema Change?

Online schema change tools come as workarounds to an old problem: schema migrations in MySQL were blocking, uninterruptible, aggressive in resources, replication unfriendly. Running a straight ALTER TABLE in production means locking your table, generating high load on the primary, causing massive replication lag on replicas once the migration moves down the replication stream.

Isn’t there some Online DDL?

Yes. InnoDB supports Online DDL, where for many ALTER types, your table remains unblocked throughout the migration. That’s an important improvement, but unfortunately not enough. Some migration types do not permit concurrent DDL (notably changing column data type, e.g. from INT to BIGINT). Migration is still aggressive and generates high load on your server. Replicas still run the migration sequentially. If your migration takes 5 hours to run concurrently on the primary, expect a 5 hour replication lag on your replica, i.e. complete loss of your fresh read capacity.

Isn’t there some Instant DDL?

Yes. But unfortunately extremely limited. Mostly just for adding a new column. See here or again here. Instant DDLs showed great promise when introduced (contributed to MySQL by Tencent Games DBA Team) three years ago, and the hope was that MySQL would support many more types of ALTER TABLE in INSTANT DDL. At this time this has not happened yet, and we do with what we have.

Not everyone is Google or Facebook scale, right?

True. But you don’t need to to be Google, or Facebook, or GitHub etc. scale to feel the pain of schema changes. Any non trivially sized table takes time to ALTER, which results with lock/downtime. If your tables are limited to hundreds or mere thousands of small rows, you can get away with it. When your table grows, and a mere dozens of MB of data is enough, ALTER becomes non-trivial at best case, and outright a cause of outage in a common scenario, in my experience.

Let’s discuss foreign key constraints

In the relational model tables have relationships. A column in one table indicates a column in another table, so that a row in one table has a relationship one or more rows in another table. That’s the “foreign key”. A foreign key constraint is the enforcement of that relationship. A foreign key constraint is a database construct which watches over rows in different tables and ensures the relationship does not break. For example, it may prevent me from deleting a row that is in a relationship, to prevent the related row(s) from becoming orphaned. Continue reading » “The problem with MySQL foreign key constraints in Online Schema Changes”