Useful database analysis queries with INFORMATION_SCHEMA

November 26, 2008

A set of useful queries on INFORMATION_SCHEMA follows. These queries can be used when approaching a new database, to learn about some of its properties, or they can be regularly used on an existing schema, so as to verify its integrity.

I will present queries for:

  • Checking on database engines and size
  • Locating duplicate and redundant indexes
  • Checking on character sets for columns and tables, looking for variances
  • Checking on processes and long queries (only with MySQL 5.1)

Dimensions

The following query returns the total size per engine per database. For example, it is common that in a given database, all tables are InnoDB. But once in a while, and even though default-engine is set to InnoDB, someone creates a MyISAM table. This may break transactional behavior, or may cause a mysqldump --single-transaction to be ineffective.

See aggregated size per schema per engine:

SELECT TABLE_SCHEMA, ENGINE, COUNT(*) AS count_tables,
  SUM(DATA_LENGTH+INDEX_LENGTH) AS size,
SUM(INDEX_LENGTH) AS index_size FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
  AND ENGINE IS NOT NULL GROUP BY TABLE_SCHEMA, ENGINE

Result example:

+--------------+--------+--------------+----------+------------+
| TABLE_SCHEMA | ENGINE | count_tables | size     | index_size |
+--------------+--------+--------------+----------+------------+
| test         | InnoDB |            3 | 12140544 |          0 |
| world        | InnoDB |            1 |  4734976 |          0 |
| world        | MyISAM |            5 | 10665303 |    4457472 |
+--------------+--------+--------------+----------+------------+

I may not have intended to, but it seems I have both MyISAM and InnoDB tables in the world database.

The index_size may be important with MyISAM when estimating the desired key_buffer_size.

See per table size (almost exactly as presented in INFORMATION_SCHEMA):

SELECT TABLE_SCHEMA, TABLE_NAME, ENGINE, 
 SUM(DATA_LENGTH+INDEX_LENGTH) AS size,
SUM(INDEX_LENGTH) AS index_size FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
 AND ENGINE IS NOT NULL GROUP BY TABLE_SCHEMA, TABLE_NAME

Indexes

We will now turn to check for duplicate or redundant indexes. We begin by presenting the following table:

mysql> show create table City \G
*************************** 1. row ***************************
       Table: City
Create Table: CREATE TABLE `City` (
  `ID` int(11) NOT NULL auto_increment,
  `Name` char(35) character set utf8 NOT NULL default '',
  `CountryCode` char(3) NOT NULL default '',
  `District` char(20) NOT NULL default '',
  `Population` int(11) NOT NULL default '0',
  PRIMARY KEY  (`ID`),
  UNIQUE KEY `ID` (`ID`),
  KEY `Population` (`Population`),
  KEY `Population_2` (`Population`,`CountryCode`)
) ENGINE=MyISAM AUTO_INCREMENT=4080 DEFAULT CHARSET=latin1

We can see that the Population_2 index covers the Population index, so the latter is redundant and should be removed. We also see that the ID index is redundant, since there is a PRIMARY KEY on ID, which is in itself a unique key. How can we test such cases by querying the INFORMATION_SCHEMA? Turns out we can do that using the STATISTICS table.

[Update: thanks to Roland Bouman's comments. The following queries only consider BTREE indexes, and do not verify FULLTEXT or HASH indexes]

See if some index is a prefix of another (in which case it is redundant):

SELECT * FROM (
  SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME,
    GROUP_CONCAT(COLUMN_NAME ORDER BY SEQ_IN_INDEX) AS columns
  FROM `information_schema`.`STATISTICS`
  WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
    AND NON_UNIQUE = 1 AND INDEX_TYPE='BTREE' 
  GROUP BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
) AS i1 INNER JOIN (
  SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME,
    GROUP_CONCAT(COLUMN_NAME ORDER BY SEQ_IN_INDEX) AS columns
  FROM `information_schema`.`STATISTICS`
  WHERE INDEX_TYPE='BTREE' 
  GROUP BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
) AS i2
USING (TABLE_SCHEMA, TABLE_NAME)
WHERE i1.columns != i2.columns AND LOCATE(CONCAT(i1.columns, ','), i2.columns) = 1

The above query lists pairs of indexes in which one of them is a true prefix of the other. I'm using GROUP_CONCAT(COLUMN_NAME ORDER BY SEQ_IN_INDEX) to aggregate columns per index, by order of appearance in that index.
The query only considers cases when the prefix (the "shorter") index is non-unique. Else wise there is no redundancy, as the uniqueness of the index imposes a constraint which is not achieved by the "longer" index.

Result example:

+--------------+------------+------------+------------+--------------+------------------------+
| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | columns    | INDEX_NAME   | columns                |
+--------------+------------+------------+------------+--------------+------------------------+
| world        | City       | Population | Population | Population_2 | Population,CountryCode |
+--------------+------------+------------+------------+--------------+------------------------+

See if any two indexes are identical:

SELECT * FROM (
  SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME,
    GROUP_CONCAT(COLUMN_NAME ORDER BY SEQ_IN_INDEX) AS columns, NON_UNIQUE
  FROM `information_schema`.`STATISTICS`
  WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
  AND INDEX_TYPE='BTREE'
  GROUP BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
) AS i1 INNER JOIN (
  SELECT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME,
    GROUP_CONCAT(COLUMN_NAME ORDER BY SEQ_IN_INDEX) AS columns, NON_UNIQUE
  FROM `information_schema`.`STATISTICS`
  WHERE INDEX_TYPE='BTREE'
  GROUP BY TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
) AS i2
USING (TABLE_SCHEMA, TABLE_NAME)
WHERE i1.columns = i2.columns AND i1.NON_UNIQUE = i2.NON_UNIQUE
  AND i1.INDEX_NAME < i2.INDEX_NAME

The above checks for unique or non-unique indexes alike. It checks for indexes with identical columns list (and in the same order, of course). Any two indexes having the same list of columns imply a redundancy. If both are unique or both are non-unique, either can be removed. If one is unique and the other is not, the non-unique index should be removed.

Result example:

+--------------+------------+------------+---------+------------+------------+---------+------------+
| TABLE_SCHEMA | TABLE_NAME | INDEX_NAME | columns | NON_UNIQUE | INDEX_NAME | columns | NON_UNIQUE |
+--------------+------------+------------+---------+------------+------------+---------+------------+
| world        | City       | PRIMARY    | ID      |          0 | ID         | ID      |          0 |
+--------------+------------+------------+---------+------------+------------+---------+------------+

You may also wish to take a look at the excellent mk-duplicate-key-checker, a maatkit utility by Baron Schwartz.

Character sets

Show the character sets for all tables:

SELECT TABLE_SCHEMA, TABLE_NAME, CHARACTER_SET_NAME, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
INNER JOIN INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
  ON (TABLES.TABLE_COLLATION = COLLATION_CHARACTER_SET_APPLICABILITY.COLLATION_NAME)
WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')

Surprisingly, the TABLES table does not include the character set for the table, only the collation, so we must join with COLLATION_CHARACTER_SET_APPLICABILITY to get the character set for that collation. Yes, it's more normalized this way, but INFORMATION_SCHEMA is not too normalized anyway.

See all the textual columns, along with their character sets:

SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
  AND CHARACTER_SET_NAME IS NOT NULL
ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME

See those columns for which the character set or collation is different from the table's character set and collation:

SELECT columns.TABLE_SCHEMA, columns.TABLE_NAME, COLUMN_NAME,
  CHARACTER_SET_NAME AS column_CHARSET,
  COLLATION_NAME AS column_COLLATION,
  table_CHARSET, TABLE_COLLATION
FROM (
  SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
  FROM INFORMATION_SCHEMA.COLUMNS
  WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
    AND CHARACTER_SET_NAME IS NOT NULL
) AS columns INNER JOIN (
  SELECT TABLE_SCHEMA, TABLE_NAME, CHARACTER_SET_NAME AS table_CHARSET, TABLE_COLLATION
  FROM INFORMATION_SCHEMA.TABLES
  INNER JOIN INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
    ON (TABLES.TABLE_COLLATION = COLLATION_CHARACTER_SET_APPLICABILITY.COLLATION_NAME)
) AS tables
ON (columns.TABLE_SCHEMA = tables.TABLE_SCHEMA AND columns.TABLE_NAME = tables.TABLE_NAME)
WHERE (columns.CHARACTER_SET_NAME != table_CHARSET OR columns.COLLATION_NAME != TABLE_COLLATION)
ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME

Result example:

+--------------+------------+-------------+----------------+------------------+---------------+-------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | column_CHARSET | column_COLLATION | table_CHARSET | TABLE_COLLATION   |
+--------------+------------+-------------+----------------+------------------+---------------+-------------------+
| world        | City       | Name        | utf8           | utf8_general_ci  | latin1        | latin1_swedish_ci |
+--------------+------------+-------------+----------------+------------------+---------------+-------------------

Processes (MySQL 5.1)

With MySQL 5.1 comes a boost to INFORMATION_SCHEMA. Among the new tables we can find the PROCESSLIST table, as well as GLOBAL_VARIABLES and GLOBAL_STATUS. Together with the new Event Scheduler, it seems the sky is the limit.

See which processes are active:

SELECT * FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep'

Show slow queries:

SELECT * FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep' AND TIME > 4

How many processes per user?

SELECT USER, COUNT(*) FROM information_schema.PROCESSLIST GROUP BY USER

How many processes per host?

SELECT SUBSTR(HOST, 1, LOCATE(':',HOST)-1) AS hostname, COUNT(*)
FROM information_schema.PROCESSLIST GROUP BY hostname

Along with the Event Scheduler, a stored procedure may decide to KILL processes executing for more than 10 minutes, KILL users who have too many connections, perform some logging on connections and more.

Conclusion

I have presented what I think is a set of useful queries. When I approach a new database I use these to get an overall understanding of what's in it. Finding duplicate indexes can explain a lot about how the designers or developers think the database should behave. Looking at the non-default character sets shows if textual columns have been carefully designed or not. For example, querying for non-default columns characters and getting no results may imply that many textual columns have improper character sets.

  • Hi! Nice post!

    Few comments though....

    In the table size query you do:

    WHERE TABLE_SCHEMA NOT IN ('mysql', 'INFORMATION_SCHEMA')
    AND ENGINE IS NOT NULL

    I am assuming you added ENGINE IS NOT NULL to get rid of views. Personally, I feel the WHERE clause would be better if written like so:

    WHERE TABLE_TYPE = 'BASE TABLE'

    This gets rid of all information schema tables (since those are all 'SYSTEM VIEW's). It does leave the tables in the mysql database in the result, but I consider that an advantage.

    Second, about your redundant indexes query:

    "See if some index is a prefix of another (in which case it is redundant):"

    This is not completely true. That is, you need to take the indexing algorithm in account too. If I have

    create table books (
    title varchar(50)
    , description varchar(255)
    , index(title)
    , fulltext index(title, description)
    )
    engine = MyISAM

    then the index on title will be a BTREE index which has radicaly different properties than the FULLTEXT index. So I would certainly not claim the index on title is redundant in this case.

    Another, interesting, example is when working with MEMORY tables. For MEMORY tables we can have BTREE and HASH indexes. Again, like we saw with FULLTEXT we should take the algorithm into account when comparing indexes. But another, more subtle thing, comes into play in this case too.

    If I have a HASH index on columns (A,B) and another HASH index on columns (B,A) then, according to your arguments, we should not consider either of these indexes redundant as they are not a prefix of each another. However, for HASH indexes, the order of the columns does not matter at all - HASH index can be used only if a value is known for all columns in the HASH index - the HASH function is applied to all column positions to compute a HASH code that is used to do the lookup.

    So a HASH index is redundant if there is another HASH index that has the same set of columns *in any order*.

    Here on mysqlforge you can find some queries on the information schema that i feel do a better job:

    http://forge.mysql.com/tools/tool.php?id=45

    kind regards,

    Roland Bouman

  • Roland,
    Thanks for your comments,

    1. Indeed you are right, I used the "ENGINE IS NOT NULL" to get rid of the views. I was considering reading the TABLE_TYPE, but decided to go for the NOT NULL. There are many column in the INFORMATION_SCHEMA whose values I don't feel comfortable with (values like 'YES' and 'NO', instead of 1, 0 etc.)

    2. Again, you are right. In this post I did not consider FULLTEXT and HASH keys at all. I don't get to cross them much; I'll review the queries.

    Thanks for the reference to your Redundant Index Finder. I'll take a closer look later on. Meanwhile it looks very interesting.

    Regards,
    Shlomi

  • Pingback: code.openark.org » Blog Archive » MySQL’s character sets and collations demystified()

 
Powered by Wordpress and MySQL. Theme by openark.org