INFORMATION_SCHEMA – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Wed, 19 Aug 2015 08:00:11 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 Baffling 5.7 global/status variables issues, unclean migration path https://shlomi-noach.github.io/blog/mysql/baffling-5-7-globalstatus-variables-issues-unclean-migration-path https://shlomi-noach.github.io/blog/mysql/baffling-5-7-globalstatus-variables-issues-unclean-migration-path#comments Fri, 07 Aug 2015 12:39:59 +0000 https://shlomi-noach.github.io/blog/?p=7327 MySQL 5.7 introduces a change in the way we query for global variables and status variables: the INFORMATION_SCHEMA.(GLOBAL|SESSION)_(VARIABLES|STATUS) tables are now deprecated and empty. Instead, we are to use the respective performance_schema.(global|session)_(variables|status) tables.

But the change goes farther than that; there is also a security change. Oracle created a pitfall of 2 changes at the same time:

  1. Variables/status moved to a different table
  2. Privileges required on said table

As an example, my non-root user gets:

mysql> show session variables like 'tx_isolation';
ERROR 1142 (42000): SELECT command denied to user 'normal_user'@'my_host' for table 'session_variables'

Who gets affected by this? Nearly everyone and everything.

  • Your Nagios will not be able to read status variables
  • Your ORM will not be able to determine session variables
  • Your replication user will fail connecting (see this post by Giuseppe)
  • And most everyone else.

The problem with the above is that involves two unrelated changes to your setup, which are not entirely simple to coordinate:

  1. Change your app code to choose the correct schema (information_schema vs. performance_schema)
  2. GRANT the permissions on your database

Perhaps at this point you still do not consider this to be a problem. You may be thinking: well, let’s first prepare by creating the GRANTs, and once that is in place, we can, at our leisure, modify the code.

Not so fast. Can you really that simply create those GRANTs?

Migration woes

How do you migrate to a new MySQL version? You do not reinstall all your servers. You want an easy migration path, and that path is: introduce one or two slaves of a newer version, see that everything works to your satisfaction, slowly upgrade all your other slaves, eventually switchover/upgrade your master.

This should not be any different for 5.7. We would like to provision a 5.7 slave in our topologies and just see that everything works. Well, we have, and things don’t just work. Our Nagios stops working for that 5.7 slave. Orchestrator started complaining (by this time I’ve already fixed it to be more tolerant for the 5.7 problems so no crashes here).

I hope you see the problem by now.

You cannot issue a GRANT SELECT ON performance_schema.global_variables TO ‘…’ on your 5.6 master.

The table simply does not exist there, which means the statement will not go to binary logs, which means it will not replicate on your 5.7 slave, which means you will not be able to SHOW GLOBAL VARIABLES on your slave, which means everything remains broken.

Yes, you can issue this directly on your 5.7 slaves. It’s doable, but undesired. It’s ugly in terms of automation (and will quite possibly break some assumptions and sanity checks your automation uses); in terms of validity testing. It’s unfriendly to GTID (make sure to SET SQL_LOG_BIN=0 before that).

WHY in the first place?

It seems like a security thing. I’m not sure whether this was intended. So you prevent a SHOW GLOBAL VARIABLES for a normal user. Makes sense. And yet:

mysql> show global variables like 'hostname';
ERROR 1142 (42000): SELECT command denied to user 'normal_user'@'my_host' for table 'global_variables'

mysql> select @@global.hostname;
+---------------------+
| @@global.hostname   |
+---------------------+
| myhost.mydomain.com |
+---------------------+

mysql> select @@version;
+--------------+
| @@version    |
+--------------+
| 5.7.8-rc-log |
+--------------+

Seems like I’m allowed access to that info after all. So it’s not strictly a security design decision. For status variable, I admit, I don’t have a similar workaround.

Solutions?

The following are meant to be solutions, but do not really solve the problem:

  • SHOW commands. SHOW GLOBAL|SESSION VARIABLES|STATUS will work properly, and will implicitly know whether to provide the results via information_schema or performance_schema tables.
    • But, aren’t we meant to be happier with SELECT queries? So that I can really do stuff that is smarter than LIKE ‘variable_name%’?
    • And of course you cannot use SHOW in server side cursors. Your stored routines are in a mess now.
    • This does not solve the GRANTs problem.
  • show_compatibility_56: an introduced variable in 5.7, boolean. It truly is a time-travel-paradox novel in disguise, in multiple respects.
    • Documentation introduces it, and says it is deprecated.
      • time-travel-paradox :O
    • But it actually works in 5.7.8 (latest)
      • time-travel-paradox plot thickens
    • Your automation scripts do not know in advance whether your MySQL has this variable
      • Hence SELECT @@global.show_compatibility_56 will produce an error on 5.6
      • But the “safe” way of SHOW GLOBAL VARIABLES LIKE ‘show_compatibility_56’ will fail on a privilege error on 5.7
      • time-travel-paradox :O
    • Actually advised by my colleague Simon J. Mudd, show_compatibility_56 defaults to OFF. I support this line of thought. Or else it’s old_passwords=1 all over again.
    • show_compatibility_56 doesn’t solve the GRANTs problem.
    • This does not solve any migration path. It just postpones the moment when I will hit the same problem. When I flip the variable from “1” to “0”, I’m back at square one.

Suggestion

I claim security is not the issue, as presented above. I claim Oracle will yet again fall into the trap of no-easy-way-to-migrate-to-GTID in 5.6 if the current solution is unchanged. I claim that there have been too many changes at once. Therefore, I suggest one of the alternative two flows:

  1. Flow 1: keep information_schema, later migration into performance_schema
    • In 5.7information_schema tables should still produce the data.
    • No security constraints on information_schema
    • Generate WARNINGs on reading from information_schema (“…this will be deprecated…”)
    • performance_schema also available. With security constraints, whatever.
    • In 5.8 remove information_schema tables; we are left with performance_schema only.
  2. Flow 2: easy migration into performance_schema:
    • In 5.7, performance_schema tables should not require any special privileges. Any user can read from them.
    • Keep show_compatibility_56 as it is.
    • SHOW commands choose between information_schema or performance_schema on their own — just as things are done now.
    • In 5.8performance_schema tables will require SELECT privileges.

As always, I love the work done by the engineers; and I love how they listen to the community.

Comments are most welcome. Have I missed the simple solution here? Are there even more complications to these features? Thoughts on my suggested two flows?

[UPDATE 2015-08-19]

Please see this followup by Morgan Tocker of Oracle.

]]>
https://shlomi-noach.github.io/blog/mysql/baffling-5-7-globalstatus-variables-issues-unclean-migration-path/feed 5 7327
common_schema & openark-kit in the media: #DBHangOps, OurSQL https://shlomi-noach.github.io/blog/mysql/common_schema-openark-kit-in-the-media-dbhangops-oursql https://shlomi-noach.github.io/blog/mysql/common_schema-openark-kit-in-the-media-dbhangops-oursql#respond Wed, 26 Jun 2013 19:46:57 +0000 https://shlomi-noach.github.io/blog/?p=6393 #DBHangOps

I had the pleasure of joining into @DBHangOps today, and speak about common_schema and openark-kit. What was meant to be a 15 minute session turned to be 50 — sorry, people, I don’t talk as much at home, but when it comes to my pet projects…

I also realized I was missing on a great event: DBHangOps is a hangout where you can chat and discuss MySQL & related technologies with friends and colleagues, with whom you typically only meet at conferences. I will certainly want to attend future events.

Thanks to John Cesario and Geoffrey Anderson who invited me to talk, and to the friends and familiar faces who attended; I was happy to talk about my work, and very interested in hearing about how it’s being put to use. We also had time to discuss ps_helper with no other than Mark Leith!

The video is available on Twitter/YouTube.

OurSQL

openark-kit has also been featured on the OurSQL podcast by Sheeri & Gerry, who did great coverage of some tools. I will disclose that more is to come; I’m happy this is in capable hands and look further to hear the next episode!

 

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-openark-kit-in-the-media-dbhangops-oursql/feed 0 6393
Hierarchical data in INFORMATION_SCHEMA and derivatives https://shlomi-noach.github.io/blog/mysql/hierarchical-data-in-information_schema-and-derivatives https://shlomi-noach.github.io/blog/mysql/hierarchical-data-in-information_schema-and-derivatives#comments Tue, 08 Jan 2013 11:19:56 +0000 https://shlomi-noach.github.io/blog/?p=5311 Just how often do you encounter hierarchical data? Consider a table with some parent-child relation, like the this classic employee table:

CREATE TABLE employee (
  employee_id INT UNSIGNED PRIMARY KEY,
  employee_name VARCHAR(100),
  manager_id INT UNSIGNED,
  CONSTRAINT `employee_manager_fk` FOREIGN KEY (manager_id) REFERENCES employee (employee_id)
) engine=innodb
;
+-------------+---------------+------------+
| employee_id | employee_name | manager_id |
+-------------+---------------+------------+
|           1 | Rachel        |       NULL |
|           2 | John          |          1 |
|           3 | Stan          |          1 |
|           4 | Naomi         |          2 |
+-------------+---------------+------------+

Questions like “What is the (nested) list of employees managed by Rachel?” or “Get Naomi’s managers chain of command” are classical questions. There are sometimes dependencies: if John leaves, does that mean Naomi loses her job, or does she get promoted, or something else? If John and Stan are on sick leave, does Rachel have any reason to come to work?

Hierarchical data is not limited to a single-table structure. Sometimes it takes a combination of a few tables (it’s just a JOIN) to make out the parent-child relation.

Hierarchical data is difficult to manage with SQL. This is especially true for MySQL, which does not support the WITH recursive query syntax.

Where can you find hierarchical data?

Even if you do not provide it by yourself, MySQL’s INFORMATION_SCHEMA has some for you. Partly obvious, partly implicit, here are some examples:

Foreign Keys

This is probably the most obvious hierarchical dataset in INFORMATION_SCHEMA. The KEY_COLUMN_USAGE table lists the table dependencies based on foreign key constraints. It’s a bit confusing, as it also lists UNIQUE constraints, and the complete list of FKs are in REFERENTIAL_CONSTRAINTS — but the latter table does not list the dependencies. You typically want to join both tables to get the complete information.

So, taking sakila‘s DVD rental shop sample,  film_actor table depends on film as well as on actor via foreign keys. film can depend on the category table, etc. The hierarchies can turn to be very complex, with multiple roots an very deep branches.

Dependency questions are very clear when speaking of foreign keys: what happens when we delete some film record? Does it hold that we also delete all references to that film on film_actor? Or do we deny deletion of said film in such case?

Redundant Keys

The KEY (col1, col2) makes the KEY (col1) redundant. The latter is not strictly required (though you may wish to keep it for covering index performance reason). The list of dominant-redundant keys makes for hierarchical data. It is typically very shallow (keys can only be redundant within the scope of a single table — how deep can you get with such small dataset?)

There is no immediate reference in INFORMATION_SCHEMA as for redundant keys, but it can be inferred by self joining the STATISTICS table onto itself. It is not immediate, since you need to do some aggregation and check for particular cases. For example, UNIQUE KEY (col1, col2) is actually made redundant by UNIQUE KEY (col1), which is just the opposite from our previous example.

common_schema provides with this implicit information now turned explicit, in the for of the redundant_keys view: the redundant_index_name and dominant_index_name columns make for parent-child relationship.

Dependency questions are a bit redundant here: the general objective is to not have dependencies. So get rid of redundant keys – and do so wisely. There’s a good discussion of index redundancies on redundant_keys‘s documentation.

Locked transactions

A transaction is locked. It is locked because another transaction holds some locks needed by locked transaction. But this transaction in itself can obtain locks needed by yet other transactions. And so we can get a hierarchy of locked transaction. The “parent” is the one blocking the “child”.

Combining InnoDB’s INNODB_TRXINNODB_LOCK_WAITSINNODB_TRX tables, we get this information. common_schema provides with this inferred data in the form of innodb_locked_transactions.

Hierarchies in locked-transactions could be deep, and they can most commonly be wide. A single transaction can lock dozens of others.

Dependency questions are “would killing this transaction release all other waiting transactions?”; “What is the one transaction I need to kill in order to release the bottleneck?”; “Why are these transactions related in the first place? Can I remove this dependency?”. etc.

Locks

You can look at the same dataset as above from a different angle. Instead of looking at transaction-lock-transaction, you can look at lock-transaction-lock. Which locks are causing other locks to be held? This is counter-intuitive to our understanding of how things work, but is valid nonetheless.

Views

A VIEW can query a table or another view. It can join multiple views. This hierarchy of view-reading-from-view can turn out to be complex; if not for the human mind then for the optimizer.

Surprisingly, there is no data in INFORMATION_SCHEMA, other than the CREATE VIEW clause, to help us out in resolving these dependencies. Even more surprising is MySQL’s own inability to get clear conclusions itself. The view definition clause is parsed, re-parsed and re-evaluated whenever information on “parent” views is required. For example, try to DESCRIBE a view. How does MySQL deduce the data types of columns? It dives in head first into the hierarchy, crawls and parses view definitions, till it resolves the answer. The mycheckpoint projects uses view hierarchies intensively. It draws powers from the hierarchy and produces some surprising data (charts from raw data, for example). But it also suffers from MySQL indirect inference of views. Checking up a deep-nested mycehckpoint view in INFORMATION_SCHEMA makes for a heavyweight dive for MySQL into locks and table handles.

Dependency questions are “what is the type of this column?”, “are there any TEMPTABLE views along the chain of execution? Or are all MERGE views?” and more.

]]>
https://shlomi-noach.github.io/blog/mysql/hierarchical-data-in-information_schema-and-derivatives/feed 2 5311
Killing InnoDB idle transactions https://shlomi-noach.github.io/blog/mysql/killing-innodb-idle-transactions https://shlomi-noach.github.io/blog/mysql/killing-innodb-idle-transactions#comments Tue, 04 Dec 2012 12:23:12 +0000 https://shlomi-noach.github.io/blog/?p=5422 The issue of terminating long-time idle open InnoDB transaction has been discussed recently by many. I wish to add my share, by proposing a quick and clean solution via common_schema.

common_schema 1.2 provides with the innodb_transactions view, which relies on INNODB_TRX – one of the InnoDB Plugin views in INFORMATION_SCHEMA – as well as on PROCESSLIST, and so is able to determine with certainty that a transaction has been idle for a long time.

innodb_transactions offers us with a sql_kill_query column, which produces a ‘KILL QUERY 12345’ type of value. So we can:

SELECT sql_kill_query FROM innodb_transactions WHERE trx_idle_seconds >= 10; 
+-------------------+
| sql_kill_query    |
+-------------------+
| KILL QUERY 292509 |
| KILL QUERY 292475 |
+-------------------+ 

common_schema‘s useful eval() routine allows us to actually invoke those KILL statements, all in a one-liner:

call eval('SELECT sql_kill_query FROM innodb_transactions WHERE trx_idle_seconds >= 10');

Technical details

  • trx_idle_seconds notes the time, in seconds, the transaction has been idle, or 0 if the transaction is not idle at all.
  • sql_kill_query is a self-generated SQL query which kills the running query, e.g. ‘KILL QUERY 12345’.
  • eval() takes a query as text, retrieves the SQL resulting column, and executes it live.

Background details

The connection between INNODB_TRX and PROCESSLIST is not synchronous. It is possible that by the time one is querying INNODB_TRX, PROCESSLIST data may change (e.g. next query is already replacing the one you were considering in INNODB_TRX). But in our case it is of little consequence: we are interested in transactions that have been idle for quite some time. Say, 10 seconds. So we are not troubled by having 200 queries per second changing under our hands.

If the transaction has been asleep for 10 seconds, and we decide to kill it, well, it is possible that just as we kill it it will turn active again. It’s a risk we take no matter what kind of solution we apply, since there’s no atomic “get-status-and-kill” operation on InnoDB transactions.

The above solution is manual: one must invoke the query which kills the idle transactions. This is as opposed to a built-in server feature which does the same. Events can used to semi-automate this: one can call upon this query once every 10 seconds, for example.

See the many related and inspiring solutions below:

]]>
https://shlomi-noach.github.io/blog/mysql/killing-innodb-idle-transactions/feed 4 5422
How common_schema split()s tables – internals https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals#comments Thu, 06 Sep 2012 05:25:07 +0000 https://shlomi-noach.github.io/blog/?p=5035 This post exposes some of the internals, and the SQL behind QueryScript’s split. common_schema/QueryScript 1.1 introduces the split statement, which auto-breaks a “large” query (one which operates on large tables as a whole or without keys) into smaller queries, and executes them in sequence.

This makes for easier transactions, less locks held, potentially (depending on the user) more idle time released back to the database. split has similar concepts to oak-chunk-update and pt-archiver, but works differently, and implemented entirely in SQL on server side.

Take the following statement as example:

split (UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR)
  pass;

It yields with (roughly) the following statements:

UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1')) OR ((`inventory`.`inventory_id` = '1'))) AND (((`inventory`.`inventory_id` < '1000')) OR ((`inventory`.`inventory_id` = '1000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '1000'))) AND (((`inventory`.`inventory_id` < '2000')) OR ((`inventory`.`inventory_id` = '2000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '2000'))) AND (((`inventory`.`inventory_id` < '3000')) OR ((`inventory`.`inventory_id` = '3000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '3000'))) AND (((`inventory`.`inventory_id` < '4000')) OR ((`inventory`.`inventory_id` = '4000'))));
UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`inventory`.`inventory_id` > '4000'))) AND (((`inventory`.`inventory_id` < '4581')) OR ((`inventory`.`inventory_id` = '4581'))));

(I say “roughly” because internally there are user defined variables at play, but for convenience, I verbose the actual values as constants.)

How does that work?

common_schema works on server side. There is no Perl script or anything. It must therefore use server-side operations to:

  • Identify table to be split
  • Analyze the table in the first place, deciding how to split it
  • Analyze the query, deciding on how to rewrite it
  • Split the table (logically) into unique and distinct chunks
  • Work out the query on each such chunk

Following is an internal look at how common_schema does all the above.

Identifying the table

When query operates on a single table, split is able to parse the query’s SQL and find out that table. When multiple tables are involved, split requires user instruction: which table is it that the query should be split by?

Analyzing the table

Table analysis is done via a similar method to candidate_keys_recommended. It is almost identical, only it uses INFORMATION_SCHEMA optimizations to make the query short and lightweight. Simulating the analysis using candidate_keys_recommended, we get:

mysql> select * from candidate_keys_recommended where table_name='inventory' \G
*************************** 1. row ***************************
          table_schema: sakila
            table_name: inventory
recommended_index_name: PRIMARY
          has_nullable: 0
            is_primary: 1
 count_column_in_index: 1
          column_names: inventory_id

This is cool, simple and very easy to work with: we choose to split the table via the inventory_id column, which is conveniently an integer. We’ll soon see split can handle complex cases as well.

Analyzing the query

This is done in part via Roland’s query_analysis_routines, and in part just parsing the query, looking for WHERE, GROUP BY, LIMIT etc. clauses.

The nice part is injecting a WHERE condition, which didn’t appear in the original query. That WHERE condition is what limits the query to a distinct chunk of rows.

Splitting the table

With a single INTEGER PRIMARY KEY this sounds simple, right? Take rows 1..1,000, then 1,001..2,000, then 2,001..3,000 etc.

Wrong: even with this simple scenario, things are much more complex. Are the numbers successive? What if there are holes? What if there is a 1,000,000 gap between every two numbers? What if there are multiple holes of differing size and frequency?

And if we have two columns in our UNIQUE KEY? What if one of them is textual, not an INTEGER, the other a TIMESTAMP, not an INTEGER either?

split doesn’t work in that naive way. It makes no assumptions on the density of values. It only requires:

  • some UNIQUE KEY to work with,
  • which has no NULL values.

Given the above, it uses User Defined Variables to setup the chunks. With our single INTEGER column, the minimum value is set like this:

select 
  inventory_id 
from 
  `sakila`.`inventory` 
order by 
  inventory_id ASC 
limit 1  
into @_split_column_variable_min_1
;

This sets the first value of the first chunk. What value terminates this chunk? It is calculated like this:

select 
  inventory_id 
from (
  select 
    inventory_id 
  from 
    `sakila`.`inventory` 
  where 
    (((`inventory`.`inventory_id` > @_split_column_variable_range_start_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_range_start_1))) and (((`inventory`.`inventory_id` < @_split_column_variable_max_1)) OR ((`inventory`.`inventory_id` = @_split_column_variable_max_1))) 
  order by 
    inventory_id ASC limit 1000 
  ) sel_split_range  
order by 
  inventory_id DESC 
limit 1  
into @_split_column_variable_range_end_1
;

Now there’s a query you wouldn’t want to work by hand, now would you?

The cool part here is that the above works well for any type of column; this doesn’t have to be an INTEGER. Dates, strings etc. are all just fine.

The above also works well for multiple columns, where the query gets more complicated (see following).

Working out the query per chunk

This part is the easy one, now that all the hard work is done. We know ho to manipulate the query, we know the lower and upper boundaries of the chunk, so we just fill in the values and execute.

Multi-columns keys

Consider a similar query on sakila.film_actor, where the PRIMARY KEY is a compound of two columns:

split (UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR)
  throttle 2;

The chunked queries will look like this:

UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` > '1')) OR ((`film_actor`.`actor_id` = '1') AND (`film_actor`.`film_id` = '1'))) AND (((`film_actor`.`actor_id` < '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` < '293')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` = '293'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '39')) OR ((`film_actor`.`actor_id` = '39') AND (`film_actor`.`film_id` > '293'))) AND (((`film_actor`.`actor_id` < '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` < '234')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` = '234'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '76')) OR ((`film_actor`.`actor_id` = '76') AND (`film_actor`.`film_id` > '234'))) AND (((`film_actor`.`actor_id` < '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` < '513')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` = '513'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '110')) OR ((`film_actor`.`actor_id` = '110') AND (`film_actor`.`film_id` > '513'))) AND (((`film_actor`.`actor_id` < '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` < '278')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` = '278'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '146')) OR ((`film_actor`.`actor_id` = '146') AND (`film_actor`.`film_id` > '278'))) AND (((`film_actor`.`actor_id` < '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` < '862')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` = '862'))));
UPDATE sakila.film_actor SET last_update = last_update + INTERVAL 6 HOUR WHERE ((((`film_actor`.`actor_id` > '183')) OR ((`film_actor`.`actor_id` = '183') AND (`film_actor`.`film_id` > '862'))) AND (((`film_actor`.`actor_id` < '200')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` < '993')) OR ((`film_actor`.`actor_id` = '200') AND (`film_actor`.`film_id` = '993'))));

View the complete command to realize just how much more complex each query is, and how much more complex the chunking becomes. Here’s how I evaluate the chunk’s “next range end” variables:

select 
  actor_id, film_id 
from (
  select 
    actor_id, film_id 
  from 
    `sakila`.`film_actor` 
  where 
    (((`film_actor`.`actor_id` > @_split_column_variable_range_start_1)) OR ((`film_actor`.
`actor_id` = @_split_column_variable_range_start_1) AND (`film_actor`.`film_id` > @_split_column_variable_range_start_2))) and (((`film_actor`.`actor_id` < @_split_column_variable_max_1)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` < @_split_column_variable_max_2)) OR ((`film_actor`.`actor_id` = @_split_column_variable_max_1) AND (`film_actor`.`film_id` = @_split_column_variable_max_2))) 
  order by 
    actor_id ASC, film_id ASC 
  limit 1000 
  ) sel_split_range  
order by 
  actor_id DESC, film_id DESC 
limit 1  
into @_split_column_variable_range_end_1, @_split_column_variable_range_end_2
;

By the way, you may recall that everything is done server side. The WHERE condition for the chunked queries is in itself generated via SQL statement, and not too much by programmatic logic. Here’s part of the query which computes the limiting condition:

  select
    group_concat('(', partial_comparison, ')' order by n separator ' OR ') as comparison
  from (
    select 
      n,
      group_concat('(', column_name, ' ', if(is_last, comparison_operator, '='), ' ', variable_name, ')' order by column_order separator ' AND ') as partial_comparison
    from (
      select 
        n, CONCAT(mysql_qualify(split_table_name), '.', mysql_qualify(column_name)) AS column_name,
        case split_variable_type
          when 'range_start' then range_start_variable_name
          when 'range_end' then range_end_variable_name
          when 'max' then max_variable_name
        end as variable_name,
        _split_column_names_table.column_order, _split_column_names_table.column_order = n as is_last 
      from 
        numbers, _split_column_names_table 
      where 
        n between _split_column_names_table.column_order and num_split_columns 
      order by n, _split_column_names_table.column_order
    ) s1
    group by n
  ) s2
  into return_value
  ;

There is a lot of complexity to split to make it able to provide with as clean a syntax for the user as possible.

]]>
https://shlomi-noach.github.io/blog/mysql/how-common_schema-splits-tables-internals/feed 5 5035
Table split(…) for the masses https://shlomi-noach.github.io/blog/mysql/table-split-for-the-masses https://shlomi-noach.github.io/blog/mysql/table-split-for-the-masses#respond Wed, 05 Sep 2012 05:04:05 +0000 https://shlomi-noach.github.io/blog/?p=5034 (pun intended)

common_schema‘s new split statement (see release announcement) auto-splits complex queries over large tables into smaller ones: instead of issuing one huge query, split breaks one’s query into smaller queries, each working on a different set of rows (a chunk).

Thus, it is possible to avoid holding locks for long times, allowing for smaller transactions. It also makes for breathing space for the RDBMS, at times boosting operation speed, and at times prolonging operation speed at will.

In this post I show how split exposes itself to the user, should the user wish so.

split can manage queries of the following forms:

  • DELETE FROM table_name [WHERE]…
  • DELETE FROM table_name USING <multi table syntax> [WHERE]…
  • UPDATE table_name SET … [WHERE]…
  • UPDATE <multiple tables> SET … [WHERE]…
  • INSERT INTO some_table SELECT … FROM <single or multiple tables> [WHERE]…
  • REPLACE INTO some_table SELECT … FROM <single or multiple tables> [WHERE]…
  • SELECT … FROM <multiple tables> [WHERE]…

The latter being a non-obvious one at first sight.

Basically, it’ automatic

You just say:

split (UPDATE sakila.inventory SET last_update = last_update + INTERVAL 6 HOUR)
  throttle 2;

And split identifies sakila.inventory as the table which needs to be split, and injects appropriate conditions so as to work on a subset of the rows, in multiple steps.

By the way, here’s how to execute a QueryScript code like the above.

But you can drive in manual mode

You can use the following syntax:

split (sakila.inventory)
{
  -- No action taken, but this block of code
  -- is executed per chunk of the table.
  -- I wonder what can be done here?
}

split provides with magic variables, which you can use in the action block. These are:

  • $split_step: 1-based loop counter
  • $split_rowcount: number of rows affected in current chunk operation
  • $split_total_rowcount: total number of rows affected during this split statement
  • $split_total_elapsed_time: number of seconds elapsed since beginning of this split operation.
  • $split_clause: the magic variable: the filtering condition limiting rows to current chunk.
  • $split_table_schema: the explicit or inferred schema of split table
  • $split_table_name: the explicit or inferred table being split

To illustrate, consider the following script:

split (sakila.inventory)
{
  select $split_step as step, $split_clause as clause;
}

The output is this:

+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| step | clause                                                                                                                                                                    |
+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|    1 | ((((`inventory`.`inventory_id` > '1')) OR ((`inventory`.`inventory_id` = '1'))) AND (((`inventory`.`inventory_id` < '1000')) OR ((`inventory`.`inventory_id` = '1000')))) |
+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

+------+--------------------------------------------------------------------------------------------------------------------------------------+
| step | clause                                                                                                                               |
+------+--------------------------------------------------------------------------------------------------------------------------------------+
|    2 | ((((`inventory`.`inventory_id` > '1000'))) AND (((`inventory`.`inventory_id` < '2000')) OR ((`inventory`.`inventory_id` = '2000')))) |
+------+--------------------------------------------------------------------------------------------------------------------------------------+

+------+--------------------------------------------------------------------------------------------------------------------------------------+
| step | clause                                                                                                                               |
+------+--------------------------------------------------------------------------------------------------------------------------------------+
|    3 | ((((`inventory`.`inventory_id` > '2000'))) AND (((`inventory`.`inventory_id` < '3000')) OR ((`inventory`.`inventory_id` = '3000')))) |
+------+--------------------------------------------------------------------------------------------------------------------------------------+

+------+--------------------------------------------------------------------------------------------------------------------------------------+
| step | clause                                                                                                                               |
+------+--------------------------------------------------------------------------------------------------------------------------------------+
|    4 | ((((`inventory`.`inventory_id` > '3000'))) AND (((`inventory`.`inventory_id` < '4000')) OR ((`inventory`.`inventory_id` = '4000')))) |
+------+--------------------------------------------------------------------------------------------------------------------------------------+

+------+--------------------------------------------------------------------------------------------------------------------------------------+
| step | clause                                                                                                                               |
+------+--------------------------------------------------------------------------------------------------------------------------------------+
|    5 | ((((`inventory`.`inventory_id` > '4000'))) AND (((`inventory`.`inventory_id` < '4581')) OR ((`inventory`.`inventory_id` = '4581')))) |
+------+--------------------------------------------------------------------------------------------------------------------------------------+

So you can get yourself a nice present: the SQL clause which filters the distinct chunks.

A simple demo: what can the user do with “manual mode”?

Normally, I would expect the user to use the automated version of split. Let it do the hard work! But sometimes, you may wish to take control into your hands.

Consider an example: I wish to export a table into CSV file, but in chunks. pt-archiver does that. But it is also easily achievable with split:

split (sakila.inventory) {
  var $file_name := QUOTE(CONCAT('/tmp/inventory_chunk_', $split_step, '.csv'));
  select * from sakila.inventory WHERE :${split_clause} INTO OUTFILE :${file_name};
}

This script uses the powerful variable expansion feature of QueryScript: it extracts the text behind :${split_clause} and plants it as part of the query. It does the same for :${file_name}, making a variable possible where MySQL would normally disallow one (the INTO OUTFILE clause only accepts a constant string).

What do we get as result?

bash:/tmp$ ls -s1 inventory_chunk_*
32 inventory_chunk_1.csv
32 inventory_chunk_2.csv
32 inventory_chunk_3.csv
32 inventory_chunk_4.csv
20 inventory_chunk_5.csv

Conclusion

During the past months, and even as I developed split for QueryScript, I found myself using it more and more for my own purposes. As it evolved I realized how much more simple it makes these complex operations. Heck, it beats oak-chunk-update in its ease of use. They both have their place, but split is so much more intuitive and easy to write. And, no external scripts, no package dependencies.

I suggest that split is a major tool for server side scripting, server maintenance, developer operations. Check it out!

]]>
https://shlomi-noach.github.io/blog/mysql/table-split-for-the-masses/feed 0 5034
Auto caching INFORMATION_SCHEMA tables: seeking input https://shlomi-noach.github.io/blog/mysql/auto-caching-information_schema-tables-seeking-input https://shlomi-noach.github.io/blog/mysql/auto-caching-information_schema-tables-seeking-input#comments Thu, 08 Mar 2012 18:31:56 +0000 https://shlomi-noach.github.io/blog/?p=4761 The short version

I have it all working. It’s kind of magic. But there are issues, and I’m not sure it should even exist, and am looking for input.

The long version

In Auto caching tables I presented with a hack which allows getting cached or fresh results via a simple SELECT queries.

The drive for the above hack was INFORMATION_SCHEMA tables. There are two major problems with INFORMATION_SCHEMA:

  1. Queries on schema-oriented tables such as TABLES, COLUMNS, STATISTICS, etc. are heavyweight. How heavyweight? Enough to make a lockdown of your database. Enough to crash down your database in some cases.
  2. The data is always generated on-the-fly, as you request it. Query the COLUMNS table twice, and risk two lockdowns of your database.

The auto-cache mechanism solves issue #2. I have it working, time based. I have an auto-cache table for each of the INFORMATION_SCHEMA heavyweight tables. Say, every 30 minutes the cache is invalidated. Throughout those 30 minutes, you get a free pass!

The auto-cache mechanism also paves the road to solving issue #1: since it works by invoking a stored routine, I have better control of the way I read INFORMATION_SCHEMA. This, I can take advantage of INFORMATION_SCHEMA optimization. It’s tedious, but not complicated.

For example, if I wanted to cache the TABLES table, I don’t necessarily read the entire TABLES data in one read. Instead, I can iterate the schemata, get a list of table names per schema, then read full row data for these, table by table. The result? Many many more SELECTs, but more optimized, and no one-big-lock-it-all query.

And the problem is…

I have two burning problems.

  1. INFORMATION_SCHEMA optimization only works that much. It sometimes does not work. In particular, I’ve noticed that if you have a view which relies on another view (possibly relying on yet another view), things get out of hand. I author a monitoring tool for MySQL called mycheckpoint. It uses some fancy techniques for generating aggregated data, HTML and charts, by means of nested views. There are a few views there I can never query for in COLUMNS. It just crashes down my server. Repeatedly. And it’s a good machine with good configuration. Make that 5 machines. They all crash, repeatedly. I just can’t trust INFORMATION_SCHEMA!
  2. Replication: any caching table is bound to replicate. Does it make any sense to replicate cache for internal metadata? Does it make sense to query for the cached table on slave, to have it answer for master’s data? With plain old INFORMATION_SCHEMA, every server is on its own. Caching kinda works against this. Or is it fair enough, since we would usually expect master/slaves to reflect same schema structure?

I would feel much better if I could read SHOW statements with a SELECT query. Though I’ve found this nice hack, it can’t work from a stored function, only via stored procedure. So it can’t be used from within a SELECT query. I’ve been banging my head for months now, I think I gave up on this one.

Any insights are welcome!

]]>
https://shlomi-noach.github.io/blog/mysql/auto-caching-information_schema-tables-seeking-input/feed 11 4761
INFORMATION_SCHEMA Optimizations: still crashing my servers https://shlomi-noach.github.io/blog/mysql/information_schema-optimizations-still-crashing-my-servers https://shlomi-noach.github.io/blog/mysql/information_schema-optimizations-still-crashing-my-servers#comments Mon, 12 Dec 2011 07:35:19 +0000 https://shlomi-noach.github.io/blog/?p=4561 [Update: need to take more breaks: now NOT crashing my servers! See clarifications below]

INFORMATION_SCHEMA Optimizations are meant to make your INFORMATION_SCHEMA queries lighter and safer.

For example, if you’re going to query the COLUMNS table for just the columns of a single table, then the following:

SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='sakila' AND TABLE_NAME='rental'

makes for an optimization: specifying a literal on TABLE_SCHEMA avoid scanning the directories of other schemata. Specifying a literal on TABLE_NAME avoids checking up on other tables. So it’s a one-schema-one-table read operation, as opposed to “first read every single column from all and any single schema and table, then return only those I’m interested in”.

Here’s the execution plan for the above query:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: COLUMNS
         type: ALL
possible_keys: NULL
          key: TABLE_SCHEMA,TABLE_NAME
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Using where; Open_frm_only; Scanned 0 databases

What I tried to do is to read the entire COLUMNS table, one schema at a time, one table at a time. I’m good with this taking longer time.

I have a production system on which reads from COLUMNS consistently crash the servers. Well, one read at a time can’t do harm, right?

Unfortunately, as the title of this posts reveals, even sequential read of COLUMNS using INFORMATION_SCHEMA optimization does not help: a minute into the process and the client lost connection. The server crashed.

I was expecting that table locks would be released, buffers released etc. One at a time, there wouldn’t be a congestion of locks, reads, table cache suffocation etc.

Was actually having high hopes for this to succeed. I have to find a way in which INFORMATION_SCHEMA tables are not dangerous.

A few hours later, and I have both conclusions and achievements.

There are indeed memory issues with querying from INFORMATION_SCHEMA tables. I’ve found that VARCHAR(64) columns can consume 64K each: I’m reading from large tables of more than 1,000 columns each, while monitoring MySQL’s memory consumption. By dividing the increase in memory by the number of rows resulting from a query I sent, and which was for one single columns, I got an almost exact 64K value per row.

So a query on INFORMATION_SCHEMA consumes much more memory than it should. The good news is that this memory is released once the query terminates. So there is no leak into the session memory.

This is combined with a mistake of mine in the way I iterated the tables, such that the problem was amplified: I happened to query much more than I needed, and so got my query’s memory bloated. That is to say, I used the INFORMATION_SCHEMA optimizations only partly right, and so got only part of the savings it could offer me.

With better pinpointing I’m now actually able to read from COLUMNS, without crashing my servers, consistently.

I will further look into the 64K issue. That in itself still drains a lot of memory: on my mycheckpoint schema tables a singe table read means > 64MB of query memory down the drain.

]]>
https://shlomi-noach.github.io/blog/mysql/information_schema-optimizations-still-crashing-my-servers/feed 2 4561
More MySQL foreach() https://shlomi-noach.github.io/blog/mysql/more-mysql-foreach https://shlomi-noach.github.io/blog/mysql/more-mysql-foreach#comments Fri, 02 Dec 2011 13:55:32 +0000 https://shlomi-noach.github.io/blog/?p=4171 In my previous post I’ve shown several generic use cases for foreach(), a new scripting functionality introduced in common_schema.

In this part I present DBA’s handy syntax for schema and table operations and maintenance.

Confession: while I love INFORMATION_SCHEMA‘s power, I just hate writing queries against it. It’s just so much typing! Just getting the list of tables in a schema makes for this heavy duty query:

SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='sakila' AND TABLE_TYPE='BASE TABLE';

When a join is involved this really becomes a nightmare. I think it’s cumbersome, and as result, many do not remember the names and meaning of columns, making for “oh, I need to read the manual all over again just to get that query right”. Anyway, that’s my opinion.

A SHOW TABLES statement is easier to type, but cannot be integrated into a SELECT query (though we have a partial solution for that, too), and besides, when filtering out the views, the SHOW statement becomes almost as cumbersome as the one on INFORMATION_SCHEMA.

Which is why foreach() offers handy shortcuts to common iterations on schemata and tables, as follows:

Use case: iterate all databases

call foreach('schema', 'CREATE TABLE ${schema}.event(event_id INT, msg VARCHAR(128))');

In the above we execute a query on each database. Hmmm, maybe not such a good idea to perform this operation on all databases? Let’s filter them:

Use case: iterate databases by name match

call foreach('schema like wordpress_%', 'ALTER TABLE ${schema}.wp_posts MODIFY COLUMN comment_author VARCHAR(96) NOT NULL');

The above will only iterate my WordPress databases (I have several of these), performing an ALTER on wp_posts for each of those databases.

I don’t have to quote the like expression, but I can, if I wish to.

I can also use a regular expression match:

call foreach('schema ~ /^wordpress_[0-9]+$/', 'ALTER TABLE ${schema}.wp_posts MODIFY COLUMN comment_author VARCHAR(96) NOT NULL');

Use case: iterate tables in a specific schema

Time to upgrade our sakila tables to InnoDB’s compressed format. We use $(), a synonym for foreach().

call $('table in sakila', 'ALTER TABLE ${schema}.${table} ENGINE=InnoDB ROW_FORMAT=COMPRESSED');

The above will iterate on tables in sakila. I say tables, since it will avoid iterating views (there is still no specific syntax for views iteration). This is done on purpose, as my experience shows there is very little in common between tables and views when it comes to maintenance and operations.

Use case: iterate tables by name match

Here’s a interesting scenario: you wish to work on all tables matching some name. The naive approach would be to:

SELECT TABLE_SCHEMA, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'wp_posts' AND TABLE_TYPE = 'BASE TABLE'

Wait! Are you aware this may bring your server down? This query will open all databases at once, opening all .frm files (though thankfully not data files, since we only check for name and type).

Here’s a better approach:

call foreach('table like wp_posts', 'ALTER TABLE ${schema}.${table} ENGINE=InnoDB');

(There’s now FULLTEXT to InnoDB, so the above can make sense in the near future!)

The good part is that foreach() will look for matching tables one database at a time. It will iterate the list of database, then look for matching tables per database, thereby optimizing the query on INFORMATION_SCHEMA.

Here, too, I can use regular expressions:

call $('table ~ /^wp_.*$/', 'ALTER TABLE ${schema}.${table} ENGINE=InnoDB');

Conclusion

This is work in the making, but, as someone who maintains a few productions servers, I’ve already put it to work.

I’m hoping the syntax is easy to comprehend. I know that since I developed it it must be far more intuitive to myself than to others. I’ve tried to keep close on common syntax and concepts from various programming languages.

I would like to get as much feedback as possible. I have further ideas and thoughts on the direction common_schema is taking, but wish take it in small steps. Your feedback is appreciated!

]]>
https://shlomi-noach.github.io/blog/mysql/more-mysql-foreach/feed 2 4171
Reading results of SHOW statements, on server side https://shlomi-noach.github.io/blog/mysql/reading-results-of-show-statements-on-server-side https://shlomi-noach.github.io/blog/mysql/reading-results-of-show-statements-on-server-side#comments Fri, 25 Nov 2011 19:39:58 +0000 https://shlomi-noach.github.io/blog/?p=4138 SHOW statements are show stoppers on server side. While clients can get a SHOW statement as a result set just as any normal SELECT, things are not as such on server side.

On server side, that is, from within MySQL itself, one cannot:

SELECT `Database` FROM (SHOW DATABASES);

One cannot:

DECLARE show_cursor CURSOR FOR SHOW TABLES;

One cannot:

SHOW TABLES INTO OUTFILE '/tmp/my_file.txt';

So it is impossible to get the results with a query; impossible to get the results from a stored routine; impossible to get the results by file reading…

Bwahaha! A hack!

For some SHOW statements, there is a way around this. I’ve been banging my head against the wall for weeks now on this. Now I have a partial solution: I’m able to read SHOW output for several SHOW statements. Namely, those SHOW statements which allow a LIKE or a WHERE clause.

For example, most are familiar with the following syntax:

USE mysql;
SHOW TABLE STATUS LIKE 'user';

However not so many know that any SHOW statement which accepts LIKE, can also accept WHERE:

SHOW TABLE STATUS WHERE Name='user'\G
*************************** 1. row ***************************
           Name: user
         Engine: MyISAM
        Version: 10
     Row_format: Dynamic
           Rows: 17
 Avg_row_length: 69
    Data_length: 1184
Max_data_length: 281474976710655
   Index_length: 2048
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2010-10-03 08:23:48
    Update_time: 2011-07-30 19:31:00
     Check_time: NULL
      Collation: utf8_bin
       Checksum: NULL
 Create_options:
        Comment: Users and global privileges

It’s not just about “Name“. I can filter using any column I like:

SHOW TABLE STATUS WHERE Rows > 1000;
SHOW TABLE STATUS WHERE Rows > 1000 AND Index_length > 65536;

etc.

If you’ve been to my talk on Programmatic Queries: things you can code with SQL, you have a good guess as for where I’m taking this.

Where there’s WHERE, there’s code

I can write code within the WHERE clause. Specifically, I can work with user defined variables. Shall we cut to the point and provide with an example?

mysql> SET @databases := '';

mysql> SHOW DATABASES WHERE (@databases := CONCAT(@databases, `Database`, ',')) IS NULL;

mysql> SELECT @databases;
+-------------------------------------------------------------------+
| @databases                                                        |
+-------------------------------------------------------------------+
| information_schema,common_schema,mycheckpoint,mysql,sakila,world, |
+-------------------------------------------------------------------+

Let’s discuss the above. We:

  • Set a user variables called @databases to an empty text
  • Iterate through the SHOW DATABASES rowset. The WHERE clause is always false (the expression is in fact NOT NULL for all rows), so rows are not printed out, and we get an empty result set (we’re not really interested in a result set here, since there’s no way to read it anyhow).
  • However we do take care to “remember” the value we visit, by concatenating the `Database` column value.
  • We end up with a delimited string of database names. You’ll forgive the ending ‘,’. This is just a simple example, it is of no importance.

Further notes

What can we do with the concatenated list of database names? Whatever we want to. We can parse it again, INSERT it INTO some table, save to file, iterate, what have you!

We can wrap the above in a stored routine. Alas, not with a stored function, since the SHOW command, although returns with an empty result set, does return with a result set, not allowed withing functions.

Limitations

  • Sadly, SHOW SLAVE STATUS, SHOW MASTER LOGS etc., do not support LIKE or WHERE syntax. Bummer.
  • Stored functions, as just mentioned, cannot utilize this hack. Hey, I’m still working on this!

To what use?

Originally I wanted to avoid the time & locking it takes for INFORMATION_SCHEMA queries, such as on TABLES, COLUMNS, etc. Ironically, in a few days apart I’ve found another interesting solution (well, two, actually) to manage reads from INFORMATION_SCHEMA with less overhead than in normal use. I’ll talk about that another time; am about to use this in common_schema.

Further notes

I met Roland in London, and he liked the solution. As Baron joined, Roland said: “Baron, do you know Shlomi devised a method to read the output of SHOW commands?”

And Baron said: “Without using files? Then a SHOW statement can have a WHERE clause, in which case you can use a variable”, and went on looking for his wife.

And we remained speechless.

[UPDATE: I’ve manually changed timestamp of this post due to failure in its aggregation in planet.mysql, being a major source of incoming traffic to this site]

]]>
https://shlomi-noach.github.io/blog/mysql/reading-results-of-show-statements-on-server-side/feed 9 4138