common_schema: 1.3: security goodies, parameterized split(), json-to-xml, query checksum

common_schema 1.3 is released and is available for download. New and noteworthy in this version:

  • Parameterized split(): take further control over huge transactions by breaking them down into smaller chunks, now manually tunable if needed
  • duplicate_grantee(): copy+paste existing accounts along with their full set of privileges
  • similar_grants: find which accounts share the exact same set of privileges (i.e. have the same role)
  • json_to_xml(): translate any valid JSON object into its equivalent XML form
  • extract_json_value(): use XPath notation to extract info from JSON data, just as you would from XML
  • query_checksum(): given a query, calculate a checksum on the result set
  • random_hash(): get a 40 hexadecimal digits random hash, using a reasonably large changing input

Let’s take a closer look at the above:

Parameterized split()

split takes your bulk query and automagically breaks it down into smaller pieces. So instead of one huge UPDATE or DELETE or INSERT..SELECT transaction, you get many smaller transactions, each with smaller impact on I/O, locks, CPU.

As of 1.3, split() gets more exposed: you can have some control on its execution, and you also get a lot of very interesting info during operation.

Here’s an example of split() control:

set @script := "
  split({start:7015, step:2000} : UPDATE sakila.rental SET return_date = return_date + INTERVAL 1 DAY) 
    throttle 1;
";
call common_schema.run(@script);

In the above we choose a split size of 2,000 rows at a time; but we also choose to only start with 7015, skipping all rows prior to that value. Just what is that value? It depends on the splitting key (and see next example for just that); but in this table we can safely assume this is the rental_id PRIMARY KEY of the table.

You don’t have to use these control parameters. But they can save you some time and effort.

And, look at some interesting info about the splitting process:

set @script := "
  split(sakila.film_actor) 
    select $split_columns as columns, $split_range_start as range_start, $split_range_end as range_end
";
call common_schema.run(@script);
+----------------------+-------------+------------+
| columns              | range_start | range_end  |
+----------------------+-------------+------------+
| `actor_id`,`film_id` | '1','1'     | '39','293' |
+----------------------+-------------+------------+

+----------------------+-------------+------------+
| columns              | range_start | range_end  |
+----------------------+-------------+------------+
| `actor_id`,`film_id` | '39','293'  | '76','234' |
+----------------------+-------------+------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '76','234'  | '110','513' |
+----------------------+-------------+-------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '110','513' | '146','278' |
+----------------------+-------------+-------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '146','278' | '183','862' |
+----------------------+-------------+-------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '183','862' | '200','993' |
+----------------------+-------------+-------------+

In the above you get to be told exactly how table splitting occurs: you are being told what columns are used to split the table, and what range of values is used in each step. There’s more to it: read the split() documentation.

similar_grants

Out of your 100 different grants, which ones share the exact same set of privileges? MySQL has non notion of roles, but that doesn’t mean the notion does not exist. Multiple accounts share the same restrictions and privileges. Use similar_grants to find out which. You might just realize there’s a few redundant accounts in your system.

mysql> SELECT * FROM similar_grants;
+-------------------------------+----------------+-------------------------------------------------------+
| sample_grantee                | count_grantees | similar_grantees                                      |
+-------------------------------+----------------+-------------------------------------------------------+
| 'root'@'127.0.0.1'            |              3 | 'root'@'127.0.0.1','root'@'myhost','root'@'localhost' |
| 'repl'@'10.%'                 |              2 | 'repl'@'10.%','replication'@'10.0.0.%'                |
| 'apps'@'%'                    |              1 | 'apps'@'%'                                            |
| 'gromit'@'localhost'          |              1 | 'gromit'@'localhost'                                  |
| 'monitoring_user'@'localhost' |              1 | 'monitoring_user'@'localhost'                         |
+-------------------------------+----------------+-------------------------------------------------------+

duplicate_grantee()

Provide an existing account, and name your new, exact duplicate account. The complete set of privileges is copied, and so is the password. duplicate_grantee() is your Copy+Paste of MySQL accounts.

Let’s begin with some pre-existing account and see how it duplicates:

mysql> show grants for 'world_user'@'localhost';
+------------------------------------------------------------------------------------------------------------------------+
| Grants for world_user@localhost                                                                                        |
+------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'world_user'@'localhost' IDENTIFIED BY PASSWORD '*6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9'      |
| GRANT ALL PRIVILEGES ON `world`.* TO 'world_user'@'localhost'                                                          |
| GRANT EXECUTE, ALTER ROUTINE ON FUNCTION `sakila`.`get_customer_balance` TO 'world_user'@'localhost' WITH GRANT OPTION |
+------------------------------------------------------------------------------------------------------------------------+

mysql> call duplicate_grantee('world_user@localhost', 'copied_user@10.0.0.%');
Query OK, 0 rows affected (0.06 sec)

mysql> show grants for 'copied_user'@'10.0.0.%';
+------------------------------------------------------------------------------------------------------------------------+
| Grants for copied_user@10.0.0.%                                                                                        |
+------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'copied_user'@'10.0.0.%' IDENTIFIED BY PASSWORD '*6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9'      |
| GRANT ALL PRIVILEGES ON `world`.* TO 'copied_user'@'10.0.0.%'                                                          |
| GRANT EXECUTE, ALTER ROUTINE ON FUNCTION `sakila`.`get_customer_balance` TO 'copied_user'@'10.0.0.%' WITH GRANT OPTION |
+------------------------------------------------------------------------------------------------------------------------+

The routine is quite relaxed in grantee format. copied_user@10.0.0.%, copied_user@’10.0.0.%’ and ‘copied_user’@’10.0.0.%’ are all just fine, and represent the same account. Saves trouble with all that quoting.

json_to_xml()

JSON is becoming increasingly popular in storing dynamically-structured data. XML’s tags overhead and its human unfriendliness make it less popular today. However, the two share similar concepts, and conversion between the two is possible. json_to_xml() will translate your valid JSON data into its equivalent XML format. The rules are simple (all-nodes-and-data, no attributes, arrays as repeating nodes, objects as subnodes) and the results are valid XML objects.

Sample data taken from json.org:

mysql> SET @json := '
{
  "menu": {
    "id": "file",
    "value": "File",
    "popup": {
      "menuitem": [
        {"value": "New", "onclick": "CreateNewDoc()"},
        {"value": "Open", "onclick": "OpenDoc()"},
        {"value": "Close", "onclick": "CloseDoc()"}
      ]
    }
  }
}
';

mysql> SELECT json_to_xml(@json) AS xml \G
*************************** 1. row ***************************
xml: <menu><id>file</id><value>File</value><popup><menuitem><value>New</value><onclick>CreateNewDoc()</onclick></menuitem><menuitem><value>Open</value><onclick>OpenDoc()</onclick></menuitem><menuitem><value>Close</value><onclick>CloseDoc()</onclick></menuitem></popup></menu>

Beautified form of the above result:

<menu>
  <id>file</id>
  <value>File</value>
  <popup>
    <menuitem>
      <value>New</value>
      <onclick>CreateNewDoc()</onclick>
    </menuitem>
    <menuitem>
      <value>Open</value>
      <onclick>OpenDoc()</onclick>
    </menuitem>
    <menuitem>
      <value>Close</value>
      <onclick>CloseDoc()</onclick>
    </menuitem>
  </popup>
</menu>

Note that linked examples page uses sporadically invented attributes; common_schema prefers using well-defined nodes.

extract_json_value()

Which means things you can do with XML can also be done with JSON. XPath is a popular extraction DSL, working not only for XML but also for Object Oriented structures (see Groovy’s nice integration of XPath into the language, or just commons-beans for conservative approach). JSON is a perfect data store for XPath expressions; by utilizing the translation between JSON and XML, one is now easily able to extract value from JSON (using same example as above):

mysql> SELECT extract_json_value(@json, '//id') AS result;
+--------+
| result |
+--------+
| file   |
+--------+

mysql> SELECT extract_json_value(@json, 'count(/menu/popup/menuitem)') AS count_items;
+-------------+
| count_items |
+-------------+
| 3           |
+-------------+

Implementations of json_to_xml() and extract_json_value() are CPU intensive. There is really just one justification for having these written in Stored Procedures: their lack in the standard MySQL function library. This is reason enough. Just be aware; test with BENCHMARK().

query_checksum()

It looks like this:

mysql> call query_checksum('select id from world.City where id in (select capital from world.Country) order by id');
+----------------------------------+
| checksum                         |
+----------------------------------+
| 5f35070b90b0c079ba692048c51a89fe |
+----------------------------------+

mysql> call query_checksum('select capital from world.Country where capital is not null order by capital');
+----------------------------------+
| checksum                         |
+----------------------------------+
| 5f35070b90b0c079ba692048c51a89fe |
+----------------------------------+

The two queries above yield with the same result set. As consequence, query_checksum() produces the same checksum value for both. The next query produces a different result set, hence a different checksum:

mysql> call query_checksum('select id from world.City where id in (select capital from world.Country) order by id limit 10');
+----------------------------------+
| checksum                         |
+----------------------------------+
| 997079c2dfca34ba87ae44ed8965276e |
+----------------------------------+

The routine actually invokes the given queries (modifying them a bit along the way) and uses a deterministic incremental checksum to get the final result.

Its use? As a handy built-in mechanism for comparing your table data. This is meant for relatively small result sets – not for your 20GB table. Inspired by Baron’s old trick, and works on server side (Windows/GUI/automated clients to benefit).

random_hash()

Random hashes come handy. The naive way to produce them is by executing something like SELECT SHA1(RAND()). However the RAND() function just doesn’t provide enough plaintext for the hash function. The SHA/MD5 functions expect a textual input, and produce a 160/128 bit long hash. The maximum char length of a RAND() result is 20 characters or so, and these are limited to the 0-9 digits. So at about 10^20 options for input, which is about 64 bit. Hmmmm. a 64 bit input to generate a 160 bit output? I don’t think so! random_hash() provides additional input in the form of your current status (at about 830 characters) as well as RAND(), SYSDATE() and server ID.

Bugfixes

Any bugfix adds at least one test; typically more. Currently with over 470 tests, common_schema is built to work.

Get common_schema

common_schema 1.3 is available under the permissive New BSD License. Find the latest download here.

If you like to support common_schema, I’m always open for ideas and contributions. Or you can just spread the word!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.