Outbrain's orchestrator Version 1.0.4 is released.
orchestrator now does a much better visualization of Master-Master replication:
This is a brief list of security tips for MySQL. It is by no means complete.
mysql> CALL common_schema.eval("SELECT sql_block_account FROM sql_accounts WHERE USER = 'root'");
To move slaves around the topology (repoint a slave to a different master) through orchestrator's web interface, we use Drag and Drop, Continue Reading »
If you're running more than a few slaves in a replication topology, you might choose to use deeply nested replication: slaves replicating from other slaves at 2, 3 or even 4 levels. There are pros and cons to such topologies, discussed below.
A simple, small deep nested topology is depicted below (it is also a real production topology of ours):
Two slaves, srv-4 and srv-8 act as local masters to yet other slaves. Why would we want to have this complexity?
Here's another topology; DC info is not depicted in this image, but you can guess what designated masters we have: Continue Reading »
We're migrating some of our "vanilla" MySQL 5.5 servers to Percona Server 5.5. One of the major incentives is the crash-safe replication feature, allowing slaves to die (power failure) and resume replication without losing position in relay logs.
Whether or not we will migrate all our servers depends on further benchmarking; so far we've noticed unexpected results, but these are still premature to publish.
However the fact that we are using both MySQL & Percona Server has led us into a peculiar situation which I'd like to share. We reseed our servers via LVM snapshots. If we need a new machine, or have a corrupted slave, we capture an image of a running slave and duplicate it, a process which takes the better part of a day. This duplicates not only the data, of course, but also the relay logs, the relay-log.info file, master.info file, implying the position within the topology.
With crash safe replication this also means the transactional relay log position. Recap: crash safe replication writes, per transaction, the relay log status into ibdata1 file. So the relay log info in ibdata1 is in perfect alignment with your committed transactions. Upon server startup, Percona Server reads the info from ibdata1 and overwrites relay-log.info file (it completely disregards whatever was in that file prior to startup).
Can you guess what could get wrong here? Here's the scenario we had; the same problem can unfold in different scenarios.
Take a look at the following topology:
(this image is an actual online visualization of a replication topology; for purposes of this blog it's a sandbox topology on my laptop. Please stand by for some very cool open source release announcement shortly)
Working on some replication topology automation, here are some rules that will prevent you from replicating from a MySQL server M to a slave S:
[EDIT: the above is configuration-wise]
Did I miss anything? Please comment below.
I'm experimenting with upgrading to MySQL 5.6 and am experiencing an unexplained increase in disk I/O utilization. After discussing this with several people I'm publishing in the hope that someone has an enlightenment on this.
We have a few dozens servers in a normal replication topology. On this particular replication topology we've already evaluated that STATEMENT based replication is faster than ROW based replication, and so we use SBR. We have two different workloads on our slaves, applied by two different HAProxy groups, on three different data centres. Hardware-wise, servers of two groups use either Virident SSD cards or normal SAS spindle disks.
Our servers are I/O bound. A common query used by both workloads looks up data that does not necessarily have a hotspot, and is very large in volume. DML is low, and we only have a few hundred statements per second executed on master (and propagated through replication).
We have upgraded 6 servers from all datacenters to 5.6, both on SSD and spindle disks, and are experiencing the following phenomena: Continue Reading »
pt-query-digest, Anemometer or "Anemomaster" do a great job of analysing your queries and giving you visibility into what's going on with your MySQL servers. However, the place where the query digests are written is just some MySQL tables on some server. Do you have monitoring/alerts on that table? How will you verify a specific query does not exceed some runtime/execution count threshold, and get notified when it does?
At Outbrain we use Graphite to collect almost all of our data. We like it for its simplicity and for the fact it has a "push" strategy as opposed to "pull" strategy: every service/server/collectd writes (pushes) its own data to Graphite, as opposed to having some centralized monitoring service trying to pull data from thousands of servers & services. We also have a great Graphite dashboard (developed at our company by Erez Mazor) called graphitus, which is a very sophisticated and easily configurable visualization solution (see documentation).
Our love/hate relationship with Nagios boil down to having a single Nagios plugin: one that reads data from Graphite. We use Nagios to generate our alerts, and dream of the day we will substitute it with something else (there's not too much love in this love/hate relationship).
Graphite is a numeric timeseries data monitoring solution. How do you throw MySQL query analysis into Graphite, then?
The answer lies within the flexible structure of a Graphite metric entry, which is a freely composed path, such as collectd.hosts.us-east.myhost01.mysql.gauge-Threads_running.value. Graphite does not require you to pre-define paths, and you can use anything that makes sense to you. Thus, you can use a slow query's text, for example, as part of the Graphite entry path. This is not entirely simple as the graphite path limits the allowed characters. So this is what we do: Continue Reading »
Here's our take of master DML query monitoring at Outbrain (presented April 2014). It took a half-day to code, implement, automate and deploy, and within the first hour of work we managed to catch multiple ill-doing services and scripts. You might want to try this out for yourself.
What queries do you monitor on your MySQL servers? Many don't monitor queries at all, and only look up slow queries on occasion, using pt-query-digest. Some monitor slow queries, where Anemometer (relying on pt-query-digest) is a very good tool. To the extreme, some monitor TCP traffic on all MySQL servers -- good for you! In between, there's a particular type of queries that are of special interest: DML (INSERT/UPDATE/DELETE) queries issued against the master.
They are of particular interest because they are only issued once against the master, yet propagate through replication topology to execute on all slaves. These queries have a direct impact on your slave lag and on your overall replication capacity. I suggest you should be familiar with your DMLs just as you are with your slow queries.
In particular, we had multiple occasions in the past where all or most slaves started lagging. Frantically we would go to our metrics; yes! We would see a spike in com_insert. Someone (some service) was obviously generating more INSERTs than usual, at a high rate that the slaves could not keep up with. But, which INSERT was that? Blindly, we would look at the binary logs. Well, erm, what are we looking for, exactly?
Two such occasions convinced us that there should be a solution, but it took some time till it hit us. We were already using Anemometer for monitoring our slow logs. We can do the same for monitoring our binary logs. Thus was born "Anemomaster".
Quick recap on how Anemometer works: you issue pt-query-digest on your slow logs on all MySQL hosts (we actually first ship the slow logs to a central place where we analyse them; same thing). This is done periodically, and slow logs are then rotated. You throw the output of pt-query-digest to a central database (this is built in with pt-query-digest; it doesn't necessarily produce human readable reports). Anemometer would read this central database and visualize the slow queries.
But then, pt-query-digest doesn't only parse slow logs. It can parse binary logs. Instead of asking for total query time, we ask for query count, and on we go to establish the same mechanism, using same pt-query-digest and same Anemometer to store and visualize the DMLs issued on our masters.
When analysing DMLs we're interested in parsing binary logs -- and it makes no sense to do the same on all slaves. All slaves just have same copy of binlog entries as the master produces. It only takes one server to get an accurate picture of the DMLs on your replication topology.
While on the plane back home I wrote down all my impressions from Percona Live 2014. Have lots of TODOs and great ideas to implement. Among all my impressions, there were a few anecdotes worth noting.