DevOps – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Fri, 18 Apr 2014 18:02:49 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 “Anemomaster”: DML visibility. Your must-do for tomorrow https://shlomi-noach.github.io/blog/mysql/anemomaster-dml-visibility-your-must-do-for-tomorrow https://shlomi-noach.github.io/blog/mysql/anemomaster-dml-visibility-your-must-do-for-tomorrow#comments Fri, 18 Apr 2014 11:15:09 +0000 https://shlomi-noach.github.io/blog/?p=6793 Here’s our take of master DML query monitoring at Outbrain (presented April 2014). It took a half-day to code, implement, automate and deploy, and within the first hour of work we managed to catch multiple ill-doing services and scripts. You might want to try this out for yourself.

What’s this about?

What queries do you monitor on your MySQL servers? Many don’t monitor queries at all, and only look up slow queries on occasion, using pt-query-digest. Some monitor slow queries, where Anemometer (relying on pt-query-digest) is a very good tool. To the extreme, some monitor TCP traffic on all MySQL servers — good for you! In between, there’s a particular type of queries that are of special interest: DML (INSERT/UPDATE/DELETE) queries issued against the master.

They are of particular interest because they are only issued once against the master, yet propagate through replication topology to execute on all slaves. These queries have a direct impact on your slave lag and on your overall replication capacity. I suggest you should be familiar with your DMLs just as you are with your slow queries.

In particular, we had multiple occasions in the past where all or most slaves started lagging. Frantically we would go to our metrics; yes! We would see a spike in com_insert. Someone (some service) was obviously generating more INSERTs than usual, at a high rate that the slaves could not keep up with. But, which INSERT was that? Blindly, we would look at the binary logs. Well, erm, what are we looking for, exactly?

Two such occasions convinced us that there should be a solution, but it took some time till it hit us. We were already using Anemometer for monitoring our slow logs. We can do the same for monitoring our binary logs. Thus was born “Anemomaster”.

Quick recap on how Anemometer works: you issue pt-query-digest on your slow logs on all MySQL hosts (we actually first ship the slow logs to a central place where we analyse them; same thing). This is done periodically, and slow logs are then rotated. You throw the output of pt-query-digest to a central database (this is built in with pt-query-digest; it doesn’t necessarily produce human readable reports). Anemometer would read this central database and visualize the slow queries.

Analysing DMLs

But then, pt-query-digest doesn’t only parse slow logs. It can parse binary logs. Instead of asking for total query time, we ask for query count, and on we go to establish the same mechanism, using same pt-query-digest and same Anemometer to store and visualize the DMLs issued on our masters.

When analysing DMLs we’re interested in parsing binary logs — and it makes no sense to do the same on all slaves. All slaves just have same copy of binlog entries as the master produces. It only takes one server to get an accurate picture of the DMLs on your replication topology.

One server could be the master, and this can indeed be done: just FLUSH MASTER LOGS, parse the binary logs with pt-query-digest, and you’re done. But like others, we tend to look at our masters as tender babies. We care for them, and do not wish to overload them unnecessarily. We chose to get the binlog entries from our slaves, instead. We also chose to get the entries from the relay logs, since these are unaffected by slave performance and as long as network is good, we can expect the relay logs to be very up to date. At any given time we have two slaves that take this role (this is automated and verified). On a 10 minute period we would flush the relay logs on these servers, and analyse whatever relay logs we have not analysed as yet.

The script below is a slightly modified version of our own, and should work for the standard installation. Modify to fit your own data (in particular, it assumes relay logs are named mysqld-relay-bin; datadir is specified in /etc/my.cnf, and please don’t ask me how to do this on Windows):

#!/bin/bash

# 
# Digest latest relay logs file, write results to "anemomaster"
#
# This script can run from any machine; it only needs to execute on a single machine per mysql cluster, but for
# analysis availability it should execute on at least two hosts per cluster.
#

DATADIR=`grep datadir /etc/my.cnf|awk -F= '{print $2}'`
TMPDIR=/tmp
DIGESTED_RELAY_LOGS_FILE=${DATADIR}/digested_relay_logs.txt

touch $DIGESTED_RELAY_LOGS_FILE
chown mysql:mysql $DIGESTED_RELAY_LOGS_FILE

hostname=$(hostname)

echo "deleting old relay logs from ${TMPDIR}"
rm ${TMPDIR}/mysqld-relay-bin.[0-9]*

echo "Getting current relay log files"
existing_relay_log_files=$(ls -tr ${DATADIR}/mysqld-relay-bin.[0-9]* | head -n -1)
for existing_relay_log_file in $existing_relay_log_files
do
    cp -u $existing_relay_log_file $TMPDIR
done
echo "flushing relay logs"
/usr/bin/mysql -umyself -psecret -e 'flush relay logs\G;' 2>/dev/null
# sleep because the log file takes some time to disappear
sleep 1

echo "Getting current relay log files"
existing_relay_log_files=$(ls -tr ${DATADIR}/mysqld-relay-bin.[0-9]* | head -n -1)
for existing_relay_log_file in $existing_relay_log_files
do
    cp -u $existing_relay_log_file $TMPDIR
done

cd $TMPDIR
for relay_log_file in mysqld-relay-bin.[0-9]*
do
    # Filter this relay log file, since it's already been digested
    grep $relay_log_file $DIGESTED_RELAY_LOGS_FILE && rm $relay_log_file
done

for relay_log_file in mysqld-relay-bin.[0-9]*
do
    echo "digesting $relay_log_file"
    mysqlbinlog $relay_log_file | /usr/bin/pt-query-digest \
      --type binlog --order-by Query_time:cnt --group-by fingerprint --limit 100 \
      --review  P=3306,u=anemomaster,p=secret,h=anemomaster_host,D=anemomaster,t=global_query_review \
      --history P=3306,u=anemomaster,p=secret,h=anemomaster_host,D=anemomaster,t=global_query_review_history \
      --filter=" \$event->{Bytes} = length(\$event->{arg}) and \$event->{hostname}=\"$(hostname)\" " \
      --no-report
    echo "$relay_log_file" >> $DIGESTED_RELAY_LOGS_FILE
    rm $relay_log_file
done
# make sure the file does not bloat. 20 entries is more than enough.
tail -n 20 $DIGESTED_RELAY_LOGS_FILE > ${TMPDIR}/DIGESTED_RELAY_LOGS_FILE
cat ${TMPDIR}/DIGESTED_RELAY_LOGS_FILE > $DIGESTED_RELAY_LOGS_FILE
echo "done"

As for Anemometer, we patched it to support multiple environments (“clusters”) — but irrelevant to the DML change. If you just want to make it visualize DMLs, here’s the major configuration changes to config.inc.php (marked with bold):

$conf['history_defaults'] = array(
	'output'		=> 'table',
	'fact-group'	=> 'date',
	'fact-order'	=> 'date DESC',
	'fact-limit' => '90',
	'dimension-ts_min_start' => date("Y-m-d H:i:s", strtotime( '-90 day')),
	'dimension-ts_min_end'	=> date("Y-m-d H:i:s"),
	'table_fields' => array('date', 'query_time_avg','ts_cnt','Query_time_sum')
);

$conf['report_defaults'] = array(
	'fact-group'	=> 'checksum',
	'fact-order'	=> 'ts_cnt DESC',
	'fact-limit' => '20',
	'dimension-ts_min_start' => date("Y-m-d H:i:s", strtotime( '-1 day')),
	'dimension-ts_min_end'	=> date("Y-m-d H:i:s"),
	'table_fields' => array('checksum','snippet', 'query_time_avg','ts_cnt','Query_time_sum'),
	'dimension-pivot-hostname_max' => null,
	'dimension-pivot-clustername_max' => null
);

$conf['graph_defaults'] = array(
	'fact-group'	=> 'minute_ts',
	'fact-order'	=> 'minute_ts',
	'fact-limit' => '',
	'dimension-ts_min_start' => date("Y-m-d H:i:s", strtotime( '-7 day')),
	'dimension-ts_min_end'	=> date("Y-m-d H:i:s"),
	'table_fields' => array('minute_ts'),
	// hack ... fix is to make query builder select the group and order fields,
	// then table fields only has to contain the plot_field
	'plot_field' => 'ts_cnt',
);

With a 10 minute rotation & digestion, we are able to analyze near real-time what’s been done on our masters. If we see a spike in com_insert/com_update/com_delete, or just see slave lags, we turn to Anemomaster and within a couple minutes know exactly what service is guilty of abusing our database. We are also working to protect our database against abuse, but that’s for another discussion.

]]>
https://shlomi-noach.github.io/blog/mysql/anemomaster-dml-visibility-your-must-do-for-tomorrow/feed 5 6793
Speaking at Percona Live: common_schema, MySQL DevOps https://shlomi-noach.github.io/blog/mysql/speaking-at-perconalive-common_schema-mysql-devops https://shlomi-noach.github.io/blog/mysql/speaking-at-perconalive-common_schema-mysql-devops#respond Mon, 10 Mar 2014 08:27:42 +0000 https://shlomi-noach.github.io/blog/?p=6769 In less than a month I’ll be giving these two talks at Percona Live:

If you are still unfamiliar with common_schema, this will make for a good introduction. I’ll give you multiple reasons why you would want to use it, and how it would come to immediate use at your company. I do mean immediate, as in previous common_schema presentations I happened to get feedback emails from attendees within the same or next day letting me know how common_schema solved an insistent problem of theirs or how it exposed an unknown status.

I’ll review some useful views & routines, and discuss the ease and power of QueryScript. common_schema is a Swiss-knife of solutions, and all from within your MySQL server.

I am using common_schema in production on a regular basis, and it happened to be hero of the day in multiple occasions. I’ll present a couple such cases.

This is a technical talk touching at some cultural issues.

At Outbrain, where I work, we have two blessings: a large group of engineers and a large dataset. We at the infrastructure team, together with the ops team, are responsible for the availability of the data. What we really like is technology which lets the owners of a problem be able to recognize it and take care of it. We want ops guys to do ops, and engineers to do engineering. And we want them to be able to talk to each other and understand each other.

What tools can you use to increase visibility? To allow sharing of data between the teams? I’ll share some tools and techniques that allow us to automate deployments, detect a malfunctioning/abusing service, deploy schema changes across dozens of hosts, control data retention, monitor connections, and more.

We like open source. The tools discussed are mostly open source, or open sourced by Outbrain.

I’ll explain why these tools matter, and how they serve the purpose of removing friction between teams, allowing for quick analysis of problems and overall visibility on all things that happen.

Do come by!

]]>
https://shlomi-noach.github.io/blog/mysql/speaking-at-perconalive-common_schema-mysql-devops/feed 0 6769