Those oversized, undersized variables defaults

Some mysqld parameters are far from having reasonable defaults. Most notable are the engine-specific values, and in particular the InnoDB parameters.

Some of these variables have different defaults as of MySQL 5.4. innodb_buffer_pool_size, for example, is 128M on 5.4. innodb_log_file_size, however, has changed back and forth, as far as I understand, and is down to 5M again. These settings are still the same on 5.5.

I wish to present some not-so-obvious parameters which, in my opinion, have poor defaults, for reasons I will explain.

  • group_concat_max_len: This parameter limits the maximum text length of a GROUP_CONCAT concatenation result. It defaults to 1024. I think this is a very low value. I have been using GROUP_CONCAT more and more, recently, to solve otherwise difficult problems. And in most cases, 1024 was just too low, resulting in silent (Argh!) truncating of the result, thus returning incorrect results. It is interesting to learn that the maximum value for this parameter is limited by max_packet_size. I would suggest, then, that this parameter should be altogether removed, and have the max_packet_size limitation as the only limitation. Otherwise, I’d like it to have a very large default value, in the order of a few MB.
  • wait_timeout: Here’s a parameter whose default value is over permissive. wait_timeout enjoys an 8 hour default. I usually go for 5-10 minutes. I don’t see a point in letting idle connections waste resources for 8 hours. Applications which hold up such connections should be aware that they’re doing something wrong, in the form of a forced disconnection. Connection pools work beautifully with low settings, and can themselves do keepalives, if they choose to.
  • sql_mode: I’ve discussed this in length before. My opinion unchanged.
  • open_files_limit: What with the fact connections, threads, table descriptors, table file descriptors (depending on how you use InnoDB), temporary file tables — all are files on unix-like systems, and considering this is an inexpensive payment, I think open_files_limit should default to a few thousands. Why risk the crash of “too many open files”?

Continue reading » “Those oversized, undersized variables defaults”

mycheckpoint (rev. 132): custom monitoring, custom charts, process list dump

Revision 132 of mycheckpoint has been released. New and updated in this revision:

  • Custom monitoring: monitoring & charting for user defined queries
  • HTML reports for custom monitoring
  • Process list dump upon alert notifications

Custom monitoring & charts

Custom monitoring allows the user to supply with a query, the results of which will be monitored.

That is, mycheckpoint monitors the status variables, replication status, OS metrics. But it cannot by itself monitor one’s application. Which is why a user may supply with such query as:

SELECT COUNT(*) FROM shopping_cart WHERE is_pending=1

Such a query will tell an online store how many customers are in the midst of shopping. There is no argument that this number is worth monitoring for. Given the above query, mycheckpoint will execute it per sample, and store the query’s result along with all sampled data, to be then aggregated by complex views to answer for:

  • What was the value per given sample?
  • What is the value difference for each sample?
  • What is the change per second, i.e. the rate?

mycheckpoint goes one step forward, and explicity records another metric:

  • How much time did it take to take that sample?

Continue reading » “mycheckpoint (rev. 132): custom monitoring, custom charts, process list dump”

Choosing MySQL boolean data types

How do you implement True/False columns?

There are many ways to do it, each with its own pros and cons.

ENUM

Create you column as ENUM(‘F’, ‘T’), or ENUM(‘N’,’Y’) or ENUM(‘0’, ‘1’).

This is the method used in the mysql tables (e.g. mysql.user privileges table). It’s very simple and intuitive. It truly restricts the values to just two options, which serves well. It’s compact (just one byte).

A couple disadvantages to this method:

  1. Enums are represented by numerical values (which is good) and start with 1 instead of 0. This means ‘F’ is 1, and ‘T’ is 2, and they both translate to True when directly used in a booleanic expression (e.g. IF(val, ‘True’, ‘False’) always yields ‘True’)
  2. There’s no real convention. Is it ‘Y’/’N’? ‘T’/’F’? ‘P’/’N’? ‘1’/’0′?

CHAR(1)

Simple again. Proposed values are, as before, ‘F’, ‘T’ etc. This time there’s no way to limit the range of values. You cannot (in MySQL, unless using triggers) prevent an ‘X’.

Watch out for the charset! If it’s utf8 you pay with 3 bytes instead of just 1. And, again, ‘T’, ‘F’, ‘Y’, ‘N’ values all evaluate as True. It is possible to use the zero-valued character, but it defeats the purpose of using CHAR. Continue reading » “Choosing MySQL boolean data types”

Views: better performance with condition pushdown

Justin’s A workaround for the performance problems of TEMPTABLE views post on mysqlperformanceblog.com reminded me of a solution I once saw on a customer’s site.

The customer was using nested views structure, up to depth of some 8-9 views. There were a lot of aggregations along the way, and even the simplest query resulted with a LOT of subqueries, temporary tables, and vast amounts of data, even if only to return with a couple of rows.

While we worked to solve this, a developer showed me his own trick. His trick is now impossible to implement, but there’s a hack around this.

Let’s use the world database to illustrate. Look at the following view definition: Continue reading » “Views: better performance with condition pushdown”

A MyISAM backup is blocking as read-only, including mysqldump backup

Actually this is (almost) all I wanted to say. This is intentionally posted with all related keywords in title, in the hope that a related search on Google will result with this post on first page.

I’m just still encountering companies who use MyISAM as their storage engine and are unaware that their nightly backup actually blocks their application, basically rendering their product unavailable for long minutes to hours on a nightly basis.

So this is posted as a warning for those who were not aware of this fact.

There is no hot (non blocking) backup for MyISAM. Closest would be file system snapshot, but even this requires flushing of tables, which may take a while to complete. If you must have a hot backup, then either use replication – and take the risk of the slave not being in complete sync with the master – or use another storage engine, i.e. InnoDB.

Replication configuration checklist

This post lists the essential and optional settings for a replication environment.

It does not explain how to create replicating slaves. See How To Setup Replication for that. However, not all configuration options are well understood, and their roles in varying architectures can change.

Here are the settings for a basic Master/Slave(s) replication architecturee.

Essential

  • log-bin: enable binary logs on the master. Replication is based on the master logging all modifying queries (INSERT/CREATE/ALTER/GRANT etc.), and the slaves being able to replicate them.
  • server-id: each machine must have a unique server-id. A slave will not replay queries originating from a server with the same server-id as its own.
  • GRANT: grant a user with REPLICATION SLAVE. The host list must include all replication slave hosts.
  • expire-logs-days: automatically clean up master’s binary logs older than given value. By default, binary logs are never removed.

When working with Master/Slaves replication, one should be prepared to master failure and slave promotion to master. It may be desirable to identify a particular slave as primary candidate for promotion.

Continue reading » “Replication configuration checklist”

EXPLAIN: missing db info

I’m further developing a general log hook, which can stream queries from the general log.

A particular direction I’m taking is to filter queries by their type of actions. For example, the tool (oak-hook-general-log) can be instructed to only stream out those queries which involve creation of a temporary table; or those which cause for a filesort, or full index scan, etc.

This is done by evaluating of query execution plans on the fly. I suspect the MySQL query analyzer roughly does the same (as a small part of what it does).

It’s almost nothing one cannot do with sed/awk. However, I bumped into a couple of problems:

  1. The general log (and the mysql.general_log table, in  particular) does not indicate the particular database one is using for the query. Since slow log does indicate this data, I filed a bug on this. I currently solve this by crossing connection id with the process list, where the current database is listed. It’s shaky, but mostly works.
  2. Just realized: there’s no DB info in the EXPLAIN output! It’s weird, since I’ve been EXPLAINing queries for years now. But I’ve always had the advantage of knowing the schema used: either because I was manually executing the query on a known schema, or mk-query-digest was kind enough to let me know.

Continue reading » “EXPLAIN: missing db info”

Discovery of the day: GROUP BY … DESC

I happened on a query where, by mistake, an

SELECT ... ORDER BY x DESC LIMIT 1

was written as

SELECT ... GROUP BY x DESC LIMIT 1

And it took me by surprise to realize GROUP BY x DESC is a valid statement. I looked it up: yep! It’s documented.

In MySQL, GROUP BY results are sorted according to the group statement. You can override this by adding ORDER BY NULL (see past post). I wasn’t aware you can actually control the sort order.

Reducing locks by narrowing primary key

In a period of two weeks, I had two cases with the exact same symptoms.

Database users were experiencing low responsiveness. DBAs were seeing locks occurring on seemingly normal tables. In particular, looking at Innotop, it seemed that INSERTs were causing the locks.

In both cases, tables were InnoDB. In both cases, there was a PRIMARY KEY on the combination of all 5 columns. And in both cases, there was no clear explanation as for why the PRIMARY KEY was chosen as such.

Choosing a proper PRIMARY KEY

Especially with InnoDB, which uses clustered index structure, the PRIMARY KEY is of particular importance. Besides the fact that a bloated PRIMARY KEY bloats the entire clustered index and secondary keys (see: The depth of an index: primer), it is also a source for locks. It’s true that any UNIQUE KEY can serve as a PRIMARY KEY. But not all such keys are good candidates. Continue reading » “Reducing locks by narrowing primary key”

Database schema under version control

How many organization use version control for development? Probably almost every single one.

How many store the database schema under version control? Alas, not as many.

Coupling one’s application with table schema is essential. Organization who actively support multiple versions of the product understand that well. Smaller organizations not always have this realization.

How is it done?

Ideally

Ideally one would have:

  • The schema, generated by hand
  • Essential data (INSERT INTO statements for those lookup tables without which you cannot have an application)
  • Sample data: sufficient real-life data on which to act. This would include customers data, logs, etc.

If you can work this way, then creating a staging environment consists of re-creating the entire schema from out schema code, and filling in the essential & sample data.

The thing with this method is that one does not (usually?) apply it on one’s live system. Say we were to add a column. On our live servers we would issue an ALTER TABLE t ADD COLUMN.

But this means we’re using different methods on our staging server and on our production server.

Incremental

Another kind of solution would be to hold:

  • The static schema, as before
  • Essential data, as before
  • Sample data, as before
  • A migration script, which is the concatenation of all ALTERs, CREATEs etc. as of the static schema.

Once in a while one can do a “reset”, and update the static schema with the existing design.

As you go along

This solution simply means “we apply the changes on staging; test + version them; then apply on production”.

A side effect of this solution is that the database generates the schema, but the schema does not generate the database as in previous cases. This makes for an uglier solution, where you first apply the changes to the database, and then, based on what the database report, enter data into the version control.

How to do that? Easiest would be to use mysqldump –routines –no-data. Some further parsing should be done to strip out the AUTO_INCREMENT values, which tend to change, as well as the surrounding variables settings (strip out the character set settings etc.).

Summary

However you do it, make sure you have some kind of version control on your schema. It pays off just as with doing version control for your code. You get to compare, understand the incremental changes, understand the change in design, etc.