openark forge

Open source utilities
 

Alerts

mycheckpoint provides with conditional alerts checking and notification. It allows for:

  • User defined conditions to follow (Is the slave replicating too far behind? Is the datadir mountpoint above 90% utilization? Are there too little queries?)
  • Automatic email notifications on condition failure
  • Single/recurring email notifications on condition failure

mycheckpoint is SQL oriented, and in that spirit, the conditions are nothing but SQL conditions.

mycheckpoint currently supports conditions based on sampled data, as is found in the sv_report_sample view. This view is also used for generating human report and charts. It provides with aggregated and calculated data.

Alert conditions are located in the alert_condition table. To add a new condition, INSERT a row into alert_condition. This is the brief table definition:

mysql> DESC alert_condition;
+-------------------------------+---------------------------------------------------+------+-----+---------+----------------+
| Field                         | Type                                              | Null | Key | Default | Extra          |
+-------------------------------+---------------------------------------------------+------+-----+---------+----------------+
| alert_condition_id            | int(10) unsigned                                  | NO   | PRI | NULL    | auto_increment |
| enabled                       | tinyint(1)                                        | NO   |     | 1       |                |
| condition_eval                | varchar(4095)                                     | NO   |     | NULL    |                |
| monitored_host_condition_eval | varchar(4095)                                     | NO   |     | NULL    |                |
| description                   | varchar(255)                                      | YES  |     | NULL    |                |
| error_level                   | enum('debug','info','warning','error','critical') | NO   |     | error   |                |
| alert_delay_minutes           | smallint(5) unsigned                              | NO   |     | 0       |                |
| repetitive_alert              | tinyint(1)                                        | NO   |     | 0       |                |
+-------------------------------+---------------------------------------------------+------+-----+---------+----------------+

Description:
  • enabled: a boolean value. 0 disables the condition: it will not be checked. 1 (default) enables it.
  • condition_eval: an SQL condition to evaluate (e.g. ‘seconds_behind_master IS NULL’. See following examples). For the condition to be raised, it must evaluate to a nonzero, not NULL integer value. A result of 0 or NULL means the condition is not met and that there is no problem.
  • monitored_host_condition_eval: an SQL condition to evaluate on the monitored server; typically this would be a call to a stored function on the monitored server. Same rules apply as for condition_eval.
  • description: a human readable explanation of the condition. This will be presented in various views and in email notifications.
  • error_level: condition severity. Defaults to ‘Error’.
  • alert_delay_minutes: how much time must pass with continuous met condition, before it is considered as a problem and becomes candidate for notification.
  • repetitive_alert: a boolean value. 1 means an email notification will include this condition whenever the condition is met. 0 (default) means an email notification is only set once in any continuous meeting of this condition.

Examples

We wish to raise an alert when the slave does not replicate. The alert should be raised immediately.

INSERT INTO alert_condition (condition_eval, description, alert_delay_minutes)
  VALUES ('seconds_behind_master IS NULL', 'Slave not replicating', 0);

Same as above, using defaults:

INSERT INTO alert_condition (condition_eval, description)
  VALUES ('seconds_behind_master IS NULL', 'Slave not replicating');

We wish to raise an alert when the slave lags more than 60 seconds behind the master. Only raise this alert if this occurs for continuous sampling over 10 minutes.

INSERT INTO alert_condition (condition_eval, description, alert_delay_minutes)
  VALUES ('seconds_behind_master > 60', 'Slave lags too far behind', 10);

As before. But, since we are known to have daily maintenance during night hours, do not raise an alert at that time:

INSERT INTO alert_condition (condition_eval, description, alert_delay_minutes)
  VALUES ('(seconds_behind_master > 60) AND (HOUR(ts) NOT BETWEEN 2 AND 4)', 'Slave lags too far behind', 10);

We wish to raise an alert when the datadir mountpoint exceeds 90% disk space utilization (Linux only), and remains so for 30 minutes. We get one email notification when this first happens.

INSERT INTO alert_condition (condition_eval, description, alert_delay_minutes)
  VALUES ('os_datadir_mountpoint_usage_percent > 90', 'datadir mount point is over 90%', 30);

We wish to raise a recurring alert when the datadir mountpoint exceeds 95% disk space utilization. Keep emailing until this is resolved!

INSERT INTO alert_condition (condition_eval, description, repetitive_alert)
  VALUES ('os_datadir_mountpoint_usage_percent > 95', 'datadir mount point is over 95%', 1);

9am is our peak time. We know to expect at least 50 SELECT per second in average, or else this indicates a failure in our application code:

INSERT INTO alert_condition (condition_eval, description)
  VALUES ('(com_select_psec < 50) AND (HOUR(ts) = 9)', 'Too few SELECTs issued during peak time');

OS load average (Linux only) is too high for too long:

INSERT INTO alert_condition (condition_eval, description, alert_delay_minutes)
  VALUES ('os_loadavg > 10', 'Load average is too high', 30);

Email notifications

An example email notification is of the form:

Database alert: mycheckpoint

This is an alert mail sent by mycheckpoint, monitoring your mycheckpoint MySQL database.
The following problems have been found:

ERROR: Slave not replicating (pending id: 1)
 This error alert is pending for 25 minutes, since 2010-03-22 08:05:05
---
Resolved: Too few SELECTs (pending id: 4)

You are receiving this email from a mycheckpoint -- MySQL monitoring utility -- installation.
Please consult your system or database administrator if you do not know why you got this mail.
-------
mycheckpoint home page: http://code.openark.org/forge/mycheckpoint

In the above email one error is still pending, while another is notified as being resolved.

Reviewing alerts

mycheckpoint will evaluate all enabled alert conditions at each sampling of data (which means on each execution of mycheckpoint).

Consult alert_condition_query_view to understand which checks are performed and how they evaluate:

mysql> SELECT query FROM alert_condition_query_view \G
*************************** 1. row ***************************
query: SELECT seconds_behind_master > 60 AS condition_1 ,seconds_behind_master IS NULL AS condition_2 ,os_root_mountpoint_usage_percent > 90 AS condition_3 ,os_datadir_mountpoint_usage_percent > 90 AS condition_4 ,os_tmpdir_mountpoint_usage_percent > 90 AS condition_5 FROM mycheckpoint.sv_report_sample
ORDER BY id DESC
LIMIT 1;

Conditions which are met wre written to the alert table. This table serves as a history table for all alerts. It is automaticaly purged according to existing recorded data.

The alert table is nothing but a many-to-many connecting table. To get a history brief, consult the alert_view view:

mysql> select * from alert_view;
+--------------------+---------------------+-------------------------------+-----------------------+-------------+---------------------+
| alert_condition_id | sv_report_sample_id | condition_eval                | description           | error_level | ts                  |
+--------------------+---------------------+-------------------------------+-----------------------+-------------+---------------------+
|                  2 |                 471 | seconds_behind_master IS NULL | Slave not replicating | error       | 2010-03-15 09:39:45 |
|                  2 |                 472 | seconds_behind_master IS NULL | Slave not replicating | error       | 2010-03-15 09:39:49 |
|                  2 |                 480 | seconds_behind_master IS NULL | Slave not replicating | error       | 2010-03-15 10:41:33 |
|                  2 |                 486 | seconds_behind_master IS NULL | Slave not replicating | error       | 2010-03-16 07:49:05 |
|                  2 |                 487 | seconds_behind_master IS NULL | Slave not replicating | error       | 2010-03-17 07:10:35 |
|                  2 |                 488 | seconds_behind_master IS NULL | Slave not replicating | error       | 2010-03-17 07:10:36 |
+--------------------+---------------------+-------------------------------+-----------------------+-------------+---------------------+

The alert_pending table stores data for pending alerts: those alerts which are raised at current.

Consult alert_pending_view to get a report on all pending alerts:

mysql> SELECT * FROM alert_pending_view;
+------------------+--------------------+-------------------------------+-----------------------+-------------+---------------------+---------------------+---------------------+-----------------+----------+-------------+---------------------+------------------+
| alert_pending_id | alert_condition_id | condition_eval                | description           | error_level | alert_delay_minutes | ts_start            | ts_end              | elapsed_minutes | in_error | is_notified | ts_notified         | repetitive_alert |
+------------------+--------------------+-------------------------------+-----------------------+-------------+---------------------+---------------------+---------------------+-----------------+----------+-------------+---------------------+------------------+
|                1 |                  2 | seconds_behind_master IS NULL | Slave not replicating | error       |                   0 | 2010-03-15 09:39:45 | 2010-03-17 07:10:36 |            2730 |        1 |           1 | 2010-03-16 07:49:08 |                1 |
+------------------+--------------------+-------------------------------+-----------------------+-------------+---------------------+---------------------+---------------------+-----------------+----------+-------------+---------------------+------------------+

Consult alert_email_message_items_view to learn what kind of email notification you should expect:

mysql> SELECT * FROM alert_email_message_items_view \G
*************************** 1. row ***************************
alert_pending_id: 1
message_item: ERROR: Slave not replicating
 This error alert is pending for 2730 minutes, since 2010-03-15 09:39:45

Use alert_pending_html_view to get HTML report on pending alerts:

mysql> SELECT html FROM alert_pending_html_view \G

See sample report.

Disabling alerts

Execute:

mycheckpoint --skip-emails

To avoid sending email notifications. Alert conditions will be evaluated and alerts will be recorded.

Execute:

mycheckpoint --skip-alerts

To completely ignore the alert conditions. No conditions are evaluated nor are any alerts recorded.

 
Powered by Wordpress and MySQL. Theme by openark.org