Monitoring DML/slow queries with graphite

pt-query-digest, Anemometer or “Anemomaster” do a great job of analysing your queries and giving you visibility into what’s going on with your MySQL servers. However, the place where the query digests are written is just some MySQL tables on some server. Do you have monitoring/alerts on that table? How will you verify a specific query does not exceed some runtime/execution count threshold, and get notified when it does?

At Outbrain we use Graphite to collect almost all of our data. We like it for its simplicity and for the fact it has a “push” strategy as opposed to “pull” strategy: every service/server/collectd writes (pushes) its own data to Graphite, as opposed to having some centralized monitoring service trying to pull data from thousands of servers & services. We also have a great Graphite dashboard (developed at our company by Erez Mazor) called graphitus, which is a very sophisticated and easily configurable visualization solution (see documentation).

Our love/hate relationship with Nagios boil down to having a single Nagios plugin: one that reads data from Graphite. We use Nagios to generate our alerts, and dream of the day we will substitute it with something else (there’s not too much love in this love/hate relationship).

Graphite is a numeric timeseries data monitoring solution. How do you throw MySQL query analysis into Graphite, then?

The answer lies within the flexible structure of a Graphite metric entry, which is a freely composed path, such as collectd.hosts.us-east.myhost01.mysql.gauge-Threads_running.value. Graphite does not require you to pre-define paths, and you can use anything that makes sense to you. Thus, you can use a slow query’s text, for example, as part of the Graphite entry path. This is not entirely simple as the graphite path limits the allowed characters. So this is what we do:

Any query that is written to Graphite is transformed into a “canonical form”. We strip it of excessive information and write enough of it that still makes sense to us. Actually, we found out that we usually do well with just the bare bones of “what type of query this is and what tables are involved”. For better drill down we then go to Anemometer/Anemomaster. Hence, the canonical form of the following query:

UPDATE my_documents SET document_owner='Wallace'  WHERE document_domain='Gromit'

is simply

update_my_documents

Thankfully the pt-query-digest report tables are already timestamp based, and are already aggregated by query “fingerprints”. This makes writing this data to graphite just a matter of text normalizing. The following script is a slightly modified version of what we use. Do note that we have the notion of “clustername” which is the name of the replication topology we’re looking at. We have many topologies, like OLTP, OLAP, Metadata, etc. etc. We support this notion by adding a clustername_max column to the report tables and instructing pt-query-digest fill in this value.

We run the following shell script by cron every 10 minutes (based on the 10 minute interval of analysing our masters’ DML):

#!/bin/bash

#
# This script should run on the anemomaster machine every 10 minutes, shortly after
# binary logs / relay logs are analyzed via pt-query-digest.
#

unixtime=$(date +%s)

# Get stats for the last round 10 minutes
# The query only takes one representative from each cluster
query=" select clustername_max, sum(ts_cnt), replace(fingerprint, '\n', ' ') from global_query_review_history join global_query_review using (checksum), (select date(now()) + interval hour(now()) hour + interval (minute(now()) div 10 *10) minute as search_to_timestamp) as search_to_timestamp_sel where ts_min >= search_to_timestamp - interval 10 minute and ts_min < search_to_timestamp and hostname_max in ( select min(hostname_max) from global_query_review_history where ts_min >= search_to_timestamp - interval 10 minute and ts_min < search_to_timestamp group by clustername_max) group by clustername_max, fingerprint order by sum(ts_cnt) desc "

mysql -umyself -psecret anemomaster --silent --silent --raw -e "$query" | while IFS=$'\t' read -r -a result_values
    do
        fingerprint_cluster=${result_values[0]} ;
        fingerprint_count=${result_values[1]} ;
        fingerprint_query=${result_values[2]} ;
        fingerprint_query=$(echo $fingerprint_query | sed -r -e "s/^(-- .*)]//g")
        fingerprint_query=$(echo $fingerprint_query | tr '\n' ' ' | tr '\r' ' ' | tr '\t' ' ')
        fingerprint_query=${fingerprint_query%%(*}
        fingerprint_query=${fingerprint_query%%,*}
        fingerprint_query=${fingerprint_query%% set *}
        fingerprint_query=${fingerprint_query%% SET *}
        fingerprint_query=${fingerprint_query%% where *}
        fingerprint_query=${fingerprint_query%% WHERE *}
        fingerprint_query=${fingerprint_query%% join *}
        fingerprint_query=${fingerprint_query%% JOIN *}
        fingerprint_query=${fingerprint_query%% using *}
        fingerprint_query=${fingerprint_query%% USING *}
        fingerprint_query=${fingerprint_query%% select *}
        fingerprint_query=${fingerprint_query%% SELECT *}
        fingerprint_query=$(echo $fingerprint_query | tr -d "\`")
        fingerprint_query=$(echo $fingerprint_query | tr -d "*")
        fingerprint_query=$(echo $fingerprint_query | tr -d "?")
        fingerprint_query=$(echo $fingerprint_query | tr " " "_")
        fingerprint_query=$(echo $fingerprint_query | tr "." "__")
        echo "data.mysql.dml.${fingerprint_cluster}.${fingerprint_query}.count ${fingerprint_count} $unixtime" | nc -w 1 my.graphite.server 2003
    done

If you don’t need the “clustername stuff”, modify the query to read:

select 'mysql' as clustername_max, sum(ts_cnt), replace(fingerprint, '\n', ' ') from global_query_review_history join global_query_review using (checksum), (select date(now()) + interval hour(now()) hour + interval (minute(now()) div 10 *10) minute as search_to_timestamp) as search_to_timestamp_sel where ts_min >= search_to_timestamp - interval 10 minute and ts_min < search_to_timestamp and hostname_max in ( select min(hostname_max) from global_query_review_history where ts_min >= search_to_timestamp - interval 10 minute and ts_min < search_to_timestamp) group by fingerprint order by sum(ts_cnt) desc

The graphite metric path will look like data.mysql.dml.oltp.update_my_documents.count, which makes for a perpefctly valid metric to monitor, graphically visualize and get alerts on.

Monitoring DML/slow queries with graphite

One thought on “Monitoring DML/slow queries with graphite”

Leave a Reply