QueryScript – code.openark.org http://shlomi-noach.github.io/blog/ Blog by Shlomi Noach Tue, 27 Dec 2016 16:45:41 +0000 en-US hourly 1 https://wordpress.org/?v=5.3.3 32412571 Why delegating code to MySQL Stored Routines is poor engineering practice https://shlomi-noach.github.io/blog/mysql/why-delegating-code-to-mysql-stored-routines-is-poor-engineering-practice https://shlomi-noach.github.io/blog/mysql/why-delegating-code-to-mysql-stored-routines-is-poor-engineering-practice#comments Thu, 06 Feb 2014 08:32:17 +0000 https://shlomi-noach.github.io/blog/?p=6713 I happen to use stored routines with MySQL. In fact, my open source project common_schema heavily utilizes them. DBA-wise, I think they provide with a lot of power (alas, the ANSI:SQL 2003 syntax feels more like COBOL than a sane programming language, which is why I use QueryScript instead).

However I wish to discuss the use of stored routines as integral part of your application code, which I discourage.

The common discussion on whether to use or not use stored routines typically revolves around data transfer (with stored routines you transfer less data since it’s being processed on server side), security (with stored routines you can obfuscate/hide internal datasets, and provide with limited and expected API) and performance (with MySQL this is not what you would expect, as routines are interpreted & their queries re-evaluated, as opposed to other RDBMS you may be used to).

But I wish to discuss the use of stored routines from an engineering standpoint. The first couple of points I raise are cultural/behavioural.

2nd grade citizens

Your stored routines are not likely to integrate well with your IDE. While your Java/Scala/PHP/Ruby/whatnot code comfortably lies within your home directory, the stored routines live in their own space: a database container. They’re not as visible to you as your standard code. Your IDE is unaware of their existence and is unlikely to have the necessary plugin/state of mind to be able to view these.

This leads to difficulty in maintaining the code. People typically resort to using some SQL-oriented GUI tool such as MySQL Workbench, SequelPro or other, commercial tools. But these tools, while make it easy to edit your routine code, do not integrate (well?) with your source control. I can’t say I’ve used all GUI tools; but how many of them will have Git/SVN/Mercurial connectors? How many of them will keep local history changes once you edit a routine? I’m happy to get introduced to such a tool.

Even with such integration, you’re split between two IDEs. And if you’re the command line enthusiast, well, you can’t just svn ci -m “fixed my stored procedure bug”. Your code is simply not in your trunk directory.

It can be done. You could maintain the entire routine code from within your source tree, and hats off to all those who do it. Most will not. See later on about deployments for more on this.

Testing

While engineers are keen on writing unit tests for every class and method they create, they are less keen on doing the same for stored routines. This is an observation, having seen many instalments. And I can tell you why: your stored routine testing will not integrate well with your JUnit/PHPUnit/…

There are testing frameworks for databases, and indeed I hacked my own mini unit testing code with common_schema. But it’s a different testing framework. You might also have realized by now that testing databases is somewhat different. It can be done, and hats off again to those that implement it as common practice. Many don’t. Database are often more heavyweight to test. Not all operations done by routines are easily rolled back, which leads to having to rebuild the entire dataset before tests. This in itself leads to longer test periods and a need for multiple test databases so as to allow for concurrent builds.

How many companies practice both version control and unit testing over their routine code? I believe not many (and am happy to hear about those who do). To be more direct, of all the companies I ever consulted to: I have never seen one that does both.

Debugging

MySQL stored routines do not have built in debugging capabilities. To debug your routines, you will have to use one of two methods:

  • Simulate your routine code (ie mimic their execution on top of some interpreter). There are tools to do that. For me this is a complete NO GO and utterly untrustworthy. You can mimic what you think is how the routine should behave, but never they full behaviour. While developing common_schema I came upon plenty weird behaviour, some of it bugs, that you just can’t build into your emulation.
  • Inject debugging code into your routine code. I do that with RDebug. You can do breakpoints, step into, step out, most of the interesting stuff. Other tools do that as well. It is not the right way to go: you’re essentially modifying your code, placing more locks, communicating, and losing some functionality. It is a necessary evil solution for a necessary evil programming method… How good can that be?

The right way to go would be to have debugging API built into the MySQL server.

But, wait, that would still be next to worthless, since our discussion is over programming with stored routines: letting your application call upon stored routines in your database. Until the day where I could use my IntelliJ debugger to step from my java method which calls upon a stored procedure, and into the stored procedure itself, debugging your code is completely detached from your stored routine debugging.

Refactoring & deploying

Say you wanted to add a column to your table: you would go ahead and add it, and perhaps populate it. You would then modify your application code to support this new column, and deploy. Say you wanted to drop a table column. You would first deploy changes to your application code that ignore said column, and once the code is in place you would go and actually make the DROP.

How do you do the same with a stored routine? Support your routine accepts two parameters, and you wish to add a third?

There is no support for optional parameters. Your routine either accepts two parameters or three. Your application code will have to provide the exact number of parameters. You will have to deploy both your SQL changes and your application changes at the same time. This is by definition impossible, unless you are OK with a stop the world approach, which is unlikely in production.

Code constraints

One solution to the above is to create a new routines. Somehow “overload” it. But you can’t overload a stored routine; you’ll have to create a routine by a new name. This will allow you to slowly and smoothly migrate between the two.

Ahem, smoothly? How easy is it to find all invocations of a certain routines from your code? It will be typically lie in some String, or within some XML config file. There is no safe “find references to this procedure” IDE mechanism. There is no constraint in your IDE that will tell you “there is no such procedure” if you misspell the name.

Trash bin

Suppose you overcame the above. You now have two routines. You need to remember to DROP the old one, right? Will you?

When presenting common_schema, a common question I ask the audience is as follows:

Suppose I accessed your database and listed the entire set of stored functions and procedures. How many of them are you not even sure are in use anymore? How many of them you think you can DROP, but are too afraid to, and keep them in just in case?

I wouldn’t commonly ask that question had it not always provides a common nodding and smiling in the audience. People forget to drop their routines, and then forget about them, and are never sure whether they are used (your IDE doesn’t easily tell you that, remember? Sure, you can grep around; that’s not what most engineers would do). And those routines pile up to become trash.

Data or code?

Last but not least: a stored routine is a piece of code, right? Well, as far as the database is concerned, it’s really a piece of data. It’s located within a schema. It’s stored. It is an integral part of your data set: when you back up your data, you’re most likely to backup the code as well. When you restore, you’re likely to restore both. There are obvious advantages to that, DB-wise. Or should I say, DBA-wise. Engineering-wise? Does a database-restore operation count as code deployment? We can argue over beer.

Final notes

Having said all that: yes, I’m using an occasional stored routine. I see these occasions as a necessary evil, and sometimes it’s just the correct solution.

I’m happy to know what methods have been developed out there to overcome the above, please share; and please feel free to contradict the above.

]]>
https://shlomi-noach.github.io/blog/mysql/why-delegating-code-to-mysql-stored-routines-is-poor-engineering-practice/feed 11 6713
Converting an OLAP database to TokuDB, part 1 https://shlomi-noach.github.io/blog/mysql/converting-an-olap-database-to-tokudb-part-1 https://shlomi-noach.github.io/blog/mysql/converting-an-olap-database-to-tokudb-part-1#comments Tue, 03 Sep 2013 07:04:12 +0000 https://shlomi-noach.github.io/blog/?p=6473 This is the first in a series of posts describing my impressions of converting a large OLAP server to TokuDB. There’s a lot to tell, and the experiment is not yet complete, so this is an ongoing blogging. In this post I will describe the case at hand and out initial reasons for looking at TokuDB.

Disclosure: I have no personal interests and no company interests; we did get friendly, useful and free advice from Tokutek engineers. TokuDB is open source and free to use, though commercial license is also available.

The case at hand

We have a large and fast growing DWH MySQL setup. This data warehouse is but one component in a larger data setup, which includes Hadoop, Cassandra and more. For online dashboards and most reports, MySQL is our service. We populate this warehouse mainly via Hive/Hadoop. Thus, we have an hourly load of data from Hive, as well as a larger daily load.

There are some updates on the data, but the majority of writes are just mysqlimports of Hive queries.

Usage of this database is OLAP: no concurrency issues here; we have some should-be-fast-running queries issued by our dashboards, as well as ok-to-run-longer queries issued for reports.

Our initial and most burning trouble is with size. Today we use COMPRESSED InnoDB tables (KEY_BLOCK_SIZE is default, i.e. 8). Our data volume sums right now at about 2TB. I happen to know this translates as 4TB of uncompressed data.

However growth of data is accelerating. A year ago we would capture a dozen GB per month. Today it is a 100GB per month, and by the end of this year it may climb to 150GB per month or more.

Our data is not sharded. We have a simple replication topology of some 6 servers. Machines are quite generous as detailed following. And yet, we will be running out of resources shortly: disk space (total 2.7TB) is now running low and is expected to run out in about six months. One of my first tasks in Outbrain is to find a solution to our DWH growth problem. The solution could be sharding; it could be a commercial DWH product; anything that works.

The approach we experiment with

It was at my initial interview that I suggested TokuDB might be a good solution, with the primary reason of being so good with compression. And we decided to experiment with this simple (setup-wise) solution of compression. If we could compress the data even by 50%, that would buy us considerable time. And it’s the simplest approach as we would need to change nothing at the application side, nor add additional frameworks.

Of course, we were already using InnoDB COMPRESSED tables. How about just improving the compression? And here I thought to myself: we can try KEY_BLOCK_SIZE=4, which I know would generally compress by 50% as compared to KEY_BLOCK_SIZE=8 (not always, but in many use cases). We’re already using InnoDB so this isn’t a new beast; it will be “more of the same”. It would work.

I got myself a dedicated machine: a slave in our production topology I am free to play with. I installed TokuDB 7.0.1, later upgraded to 7.0.3, based on MySQL 5.5.30.

The machine is a Dell Inc. PowerEdge R510 machine, with 16 CPUs @ 2.1 GHz and 126 GiB RAM, 16 GiB Swap. OS is CentOS 5.7,  kernel 2.6.18. We have RAID 10 over local 10k RPM SAS disks (10x600GB disks)

How to compare InnoDB & TokuDB?

2TB of compressed data (for absolute measurement I consider it to be a 4TB worth of data) is quite a large setup. How do I do the comparison? I don’t even have too much disk space here…

We have tables of various size. Our largest is in itself 1TB (2TB uncompressed) – half of the entire volume. The rest ranging 330GB, 140GB, 120GB, 90GB, 50GB and below. We have MONTHly partitioning schemes on most tables and obviously on our larger tables.

For our smaller tables, we could just CREATE TABLE test_table LIKE small_table, populating it and comparing compression. However, the really interesting question (and perhaps the only interesting question compression-wise) is how well would our larger (and specifically largest) tables would compress.

Indeed, for our smaller tables we saw between 20% to 70% reduction in size when using stronger InnoDB compression: KEY_BLOCK_SIZE=4/2/1. How well would that work on our larger tables? How much slower would it be?

We know MySQL partitions are implemented by actual independent tables. Our testing approach was: let’s build a test_table from a one month worth of data (== one single partition) of our largest table. We tested:

  • The time it takes to load the entire partition (about 120M rows, 100GB COMPRESSED data as seen on .idb file)
  • The time it would take to load a single day’s worth of data from Hive/Hadoop (loading real data, as does our nightly import)
  • The time it would take for various important SELECT query to execute on this data.

InnoDB vs. TokuDB comparison

In this post I will only describe our impressions of compression size. I have a lot to say about TokuDB vs InnoDB partitioning and queries; this will wait till later post.

So here goes:

Engine Compression Time to Insert 1 month Table size (optimized) Time to import 1 day
InnoDB 8k 10.5h 58GB 32m
InnoDB 4k 48h 33GB unknown (too long)
TokuDB quicklz 14h 17GB 40m
TokuDB lzma (small/aggresive) 15h 7.5GB 42m

Some comments and insights:

  • Each test was performed 3-4 times. There were no significant differences on the various cycles.
  • The 1 month insert was done courtesy QueryScript split5,000 rows at a time, no throttling.
  • The 1 day import via mysqlimport. There were multiple files imported. Each file is sorted by PRIMARY KEY ASC.
  • Isn’t it nice to know that your 100GB InnoDB table actually fits within 58GB when rebuilt?
  • For InnoDB flush_logs_at_trx_commit=2, flush_method=O_DIRECT.
  • I used default configuration to TokuDB — touched nothing. More on this in later post.
  • InnoDB 4k was prohibitively slow to load data. It was so slow so as to be unacceptable. For the 1 day load it took 1 hour for a mere 20% of data to load. 1 hour was already marginal for our requirements; waiting for 5 hours was out of the question. I tested several times, never got to wait for completion. Did I say it would just be “more of the same”? 4k turned to be “not an option”.
  • I saw almost no difference in load time between the two TokuDB compression formats. Both somewhat (30%) longer than InnoDB to load, but comparable.
  • TokuDB compression: nothing short of amazing.

With InnoDB 4k being “not an option”, and with both TokuDB compressions being similar in load time yet so different in compression size, we are left with the following conclusion: if we want to compress more than our existing 8k (and we have to) – TokuDB’s agressive compression (aka small, aka lzma) is our only option.

Shameless plug

common_schema turned to be quite the “save the day” tool here. Not only did we use it to extract 100GB of data from a large dataset and load it onto our tables, it also helped out in the ALTER process for TokuDB: at this time (<= 7.0.4) TokuDB still has a bug with KEY_BLOCK_SIZE: when this option is found in table definition, it impacts TokuDB’s indexes by bloating them. This is how sql_alter_table_tokudb was born. Hopefully it will be redundant shortly.

More to come

Was our test fair? Should we have configure TokuDB differently? Is loading via small 5,000 row chunks the right way?

In the next post I will describe the process of migrating our 4TB worth of data to TokuDB, pitfalls, issues, party crushers, sport spoilers, configuration, recovery, cool behaviour and general advice you should probably want to embrace. At later stage I’ll describe how our DWH looks after migration. Finally I’ll share some (ongoing) insights on performance.

You’ll probably want to know “How much is (non compressed) 4TB of data worth in TokuDB?” Let’s keep the suspense 🙂

]]>
https://shlomi-noach.github.io/blog/mysql/converting-an-olap-database-to-tokudb-part-1/feed 8 6473
common_schema 2.2: better QueryScript isolation & cleanup; TokuDB; table_rotate, split params https://shlomi-noach.github.io/blog/mysql/common_schema-2-2-better-queryscript-isolation-tokudb-table_rotate-split-params https://shlomi-noach.github.io/blog/mysql/common_schema-2-2-better-queryscript-isolation-tokudb-table_rotate-split-params#comments Tue, 13 Aug 2013 03:39:12 +0000 https://shlomi-noach.github.io/blog/?p=6468 common_schema 2.2 is released. This is shortly after the 2.1 release; it was only meant as bug fixes release but some interesting things came up, leading to new functionality.

Highlights of the 2.2 release:

  • Better QueryScript isolation & cleanup: isolation improved across replication topology, cleanup done even on error
  • Added TokuDB related views
  • split with “index” hint (Ike, this is for you)
  • table_rotate(): a logrotate-like mechanism for tables
  • better throw()

Drill down:

Better QueryScript isolation & cleanup

common_schema 2.1 introduced persistent tables for QueryScript. This also introduced the problem of isolating concurrent scripts, all reading from and writing to shared tables. In 2.1 isolation was based on session id. However although unique per machine, collisions were possible across replication topology: a script could be issued on master, another on slave (I have such use cases) and both use same (local) session id.

With 2.2 isolation is based on server_id & session id combination; this is unique across a replication topology.

Until 2.1, QueryScript used temporary tables. This meant any error would just break the script, and the tables were left (isolated as they were, and auto-destroyed in time). With persistent tables a script throwing an error meant legacy code piling up. With common_schema 2.2 and on MySQL >= 5.5 all exceptions are caught, cleanup is made, leaving exceptions to be RESIGNALled.

TokuDB views

A couple TokuDB related views help out in converting to TokuDB and in figuring out tables status on disk:

  • sql_alter_table_tokudb will help you out to generate the complex ALTER statement to TokuDB engine if you happen to used COMPRESSED InnoDB tables with KEY_BLOCK_SIZE specified. The view generates a complex DROP KEYs & ADD KEYs statementl this is due to bug …
  • tokudb_file_map simplifies the INFORMATION_SCHEMA.Tokudb_file_map table: the original view is not normalized and is difficult to interpret and follow when your table had many indexes or is partitioned (I will write more on this shortly). with common_schema‘s tokudb_file_map you get, per table, the list of files representing that table, along with a couple Shell commands to tell you the thing you want to know most: “what is the size of my TokuDB table on disk?”

split

QueryScript’s split device now supports the index” parameter (or hint), which instructs the split() operation to use an explicitly named index. If used, the index must exist and must be UNIQUE.

table_rotate()

Rotate your tables a-la logrotate with table_rotate(): generate a new, identical, empty table, version your table, pushing older versions along the line; optionally drop older versions. You get the picture. Got some nice use case behind this on cleaning up a test database.

throw()

On MySQL >= 5.5 throw() uses SIGNAL. No more weird “table `Unknown column ‘$t’ in ‘field list’` does not exist” messages. Just plain old:

ERROR 1054 (42S22): Unknown column '$t' in 'field list'

Get it

common_schema is free and open source. It is licensed under GPL v2. This is where you can find and download latest common_schema release.

Your input is welcome! Please submit your bugs, or otherwise share your experience with common_schema.

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-2-2-better-queryscript-isolation-tokudb-table_rotate-split-params/feed 1 6468
common_schema roadmap thoughts https://shlomi-noach.github.io/blog/mysql/common_schema-roadmap-thoughts https://shlomi-noach.github.io/blog/mysql/common_schema-roadmap-thoughts#comments Mon, 22 Jul 2013 12:36:08 +0000 https://shlomi-noach.github.io/blog/?p=6435 I’m happy with common_schema; it is in fact a tool I use myself on an almost daily basis. I’m also happy to see that it gains traction; which is why I’m exposing a little bit of my thoughts on general future development. I’d love to get feedback.

Supported versions

At this moment, common_schema supports MySQL >= 5.1, all variants. This includes 5.5, 5.6, MySQL, Percona Server & MariaDB.

5.1 is today past end of line, and I’m really missing the SIGNAL/RESIGNAL syntax that I would like to use; I can do in the meanwhile with version-specific code such as /*!50500 … */. Nevertheless, I’m wondering whether I will eventually have to:

  • Support different branches of common_schema (one that supports 5.1, one that supports >= 5.5)
  • Stop support for 5.1

Of course community-wise, the former is preferred; but I have limited resources, so I would like to make a quick poll here:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

I’ll use the poll’s results as a vague idea of what people use and want. Or please use comments below to sound your voice!

rdebug

This was a crazy jump at providing a stored routine debugger and debugging API. From some talk I made I don’t see this getting traction. For the time being, I don’t see that I will concentrate my efforts on this. Actually it is almost complete. You can step-into, step-out, step-over, set breakpoints, read variables, modify variables — it’s pretty cool.

But someone will eventually have to write a GUI front-end for this (eclipse/IntelliJ/whatever); I know not many will use a command line approach for a debugger. I also know I’m not going to write the GUI front-end. So the API is there, let’s see how it rolls.

QueryScript

I will keep on improving QueryScript, and in particular split, error handling, and otherwise simplification of common tasks. I have no doubt QueryScript goes the right way: I just see how easy it is to solve complex problems with a QueryScript one-liner. Other bullets on my TODO for QueryScript:

  • Script tracking (a semi-debugging mechanism, which allows one to recognize status of script)
  • Message passing to scripts (again, a semi-debugger approach)
  • Error recovery; ability to resume script from point of failure or point of suspension. I have plenty use cases for that.

performance_schema

I will most probably include parts of Mark Leith’s ps_helper, which is released under a permissive license, and otherwise draw ideas from his work. I’m happy to learn parts of ps_helper were influenced by common_schema itself.

Hosting

common_schema will most probably move out of Google Code; by Jan 2014 there will no longer be a “Downloads” section, and I really, really, want there to be a “Downloads” section.

I could go LaunchPad, GitHub, BitBucket (they don’t have “Downloads” sections, either, do they?), other; any advice?

World domination

Yep. This is still common_schema‘s goal. More seriously, I would want to see it installed on every single MySQL server instance. Then I would control your fate. bwahahaha!

Even more seriously, if you are a happy user, please do pass the word. I can only blog so much and present so much; there are no financing resources for this project, and I need all the help I can get in promoting common_schema. Thank you!

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-roadmap-thoughts/feed 4 6435
common_schema 2.1 released: advanced & improved split(), persistent script tables, more schema analysis, and (ahem) charts! https://shlomi-noach.github.io/blog/mysql/common_schema-2-1-released-advanced-improved-split-persistent-script-tables-more-schema-analysis-and-ahem-charts https://shlomi-noach.github.io/blog/mysql/common_schema-2-1-released-advanced-improved-split-persistent-script-tables-more-schema-analysis-and-ahem-charts#comments Wed, 17 Jul 2013 18:57:06 +0000 https://shlomi-noach.github.io/blog/?p=6398 common_schema 2.1 is released! common_schema is your free & open source companion schema within your MySQL server, providing with a function library, scripting capabilities, powerful routines and ready-to-apply information and recommendations.

New and noteworthy in version 2.1:

  • Better QueryScript’s split() functionality
  • Persistent tables for QueryScript: no long held temporary tables
  • Index creation analysis, further range partition analysis
  • grant_access(): allow everyone to use common_schema
  • Ascii charts, google charts
  • debugged_routines: show routines with debug code

Other minor enhancements and bugfixes not listed.

Here’s a breakdown of the above:

split() enhancements

split is one of those parts of common_schema that (should) appeal to every DBA. Break a huge transaction automagically into smaller chunks, and don’t worry about how it’s done. If you like, throttle execution, or print progress, or…

split enhancements include:

  • A much better auto-detection-and-selection of the chunking index. split now consults all columns covered by the index, and uses realistic heuristics to decide which UNIQUE KEY on your table is best for the chunking process. A couple bugs are solved on the way; split is much smarter now.
  • Better support for multi-column chunking keys. You may now utilize the start/stop parameters even on multi column keys, passing a comma delimited of values for the split operation to start/end with, respectively. Also fixed issue for nonexistent start/stop values, which are now valid: split will just keep to the given range.
  • split no longer requires a temporary table open through the duration of its operation. See next section.

Persistent script tables

QueryScript used to use several temporary tables for its operation. Thus, a script could hold open two or three temporary tables for the entire execution duration. For long split operations, for example, this could mean hours and days.

Temporary tables are nice and quick to respond (well, MyISAM tables are, until MySQL 5.7 is out), but make for an inherent problem: stopped slaves must not shut down nor restart when replication has an open temporary table. Why? Well, because the slave forgets about the temporary tables. When it resumes operation, it will not recognize DML issued against those tables it has forgotten. That’s why oak-prepare-shutdown is so good for slaves.

When temporary tables are short-lived, this is typically not an issue. But if you are not allowed to restart your slave throughout a 24 hour operation, that’s a limitation.

As of 2.1, QueryScript does not require long held temporary tables. In fact, typical scripts do not create any temporary tables. A split operation creates and immediately drops a series of temporary tables. These are dropped even before actual split operation begins. All tables operated on are persistent InnoDB tables.

Result: safer script replication. There’s another nice side effect I may take advantage of in a later release: ability to monitor and control flow of concurrent scripts.

Schema analysis

Two noteworthy additions to schema analysis views:

  • sql_alter_table now includes the sql_drop_keys & sql_add_keys columns. For each table, you get the SQL statements to create and drop the existing indexes. I developed this when I hit this problem with TokuDB.
  • sql_range_partitions now includes the count_past_partitions & count_future_partitions; when your table is partitioned by some type of time range, these views tell you how many partitions are in the past, and how many are to be written to in the future. This turns useful when you want to rotate or otherwise set a retention policy for your range partitions.

grant_access()

The grant_access() routine GRANTs all accounts on your server with SELECT & EXECUTE privileges on common_schema. This is a quick complementary to the installation process (though you have to invoke it yourself; it’s up to you).

Ascii/google line charts

Laugh all you want! And find how cool it is to get (poor man’s) instant charting like this:

mysql> call line_chart('select ts, com_insert_psec, com_update_psec from mycheckpoint.sv_hour limit 100', 'insert per second, update per second') ;
+---------+------------------------------------------------------------------------------------------------------+
| y_scale | chart                                                                                                |
+---------+------------------------------------------------------------------------------------------------------+
| 162     | -#-------------------------------------------------------------------------------------------------- |
| 152     | ---------------------------------------------------------------------------------------------------- |
| 143     | ---------------------------------------------------------------------------------------------------- |
| 134     | ---------------------------------------------------------------------------------------------------- |
| 124     | ---------------------------------------------------------------------------------------------------- |
| 115     | ------------------------------------------------------------#--------------------------------------- |
| 106     | ---------------------------------------------------------------------------------------------------- |
| 96      | -*-------------------------------------------------------------------------------------------------- |
| 87      | ---------------------------------#-------------------------#---------------------------------------- |
| 77      | ---------------------------------------------------------------------------------#------------------ |
| 68      | ---------------------------------------------------------------------------#------------------------ |
| 59      | #-------------------------------#------------------------------------------------------------------- |
| 49      | ---##------#-#-##-#-#--#--###----------------##---------------------------------#-----#---###------- |
| 40      | --#------#--#-#--#-#-##-##----##--###########--######--------#############-*#-##--####-###---####### |
| 31      | *-**--#-#-*-**-**------**--**#-****-**-*****-*******-#---#-*------------**---#--*------------------- |
| 21      | ----*#*#*--*--*--******--**--**----*--*-----*-------**#*#**-************--#-****-******************* |
| 12      | -----*-*-*--------------------------------------------*-*-----------------*------------------------- |
|         | v::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::v |
|         | 2010-10-06 20:00:00                                                              2010-10-10 23:00:00 |
|         |     # insert per second                                                                              |
|         |     * update per second                                                                              |
+---------+------------------------------------------------------------------------------------------------------+

You can get the same in Google Image Charts format. Yes, it’s deprecated (and has been for a year — it’s still working)

mysql> call google_line_chart('select ts, com_insert_psec, com_update_psec from mycheckpoint.sv_hour limit 100', 'insert per second, update per second') \G

google_chart_url: http://chart.apis.google.com/chart?cht=lc&chs=800x350&chtt=SQL+chart+by+common_schema&chxt=x,y&chxr=1,11.9,161.7&chd=s:S9NOOGKFGKHQMONPONONNKNONNOOQINMRgLLNMMNNNNNNOONMNNNMHEFFJFfsLLMMMLLMNMMNNDVNIMKPaKLLMOMNNNONNNMMMMM,IiGGFCDBCBGFGGGGGFGGFEFGGGGGHDGGJGGGGGFGGGGGGGGGGGGGFCBCCCEHGGGFFFFFGGGFGGAKFDFFIFFFFFFFFFFFFFFFFFFF&chxs=0,505050,10,0,lt&chxl=0:|2010-10-06%2020:00:00||||||||||||||||||||||||2010-10-07%2020:00:00|||||||||||||||||||||||||2010-10-08%2021:00:00|||||||||||||||||||||||||2010-10-09%2022:00:00|||||||||||||||||||||||||2010-10-10%2023:00:00&chg=1.010101010,25,1,2,0,0&chco=ff8c00,4682b4&chdl=insert%20per%20second|update%20per%20second&chdlp=b

The above translates into the following image:

Throw you own query in. Make 1st column your ordering column, 2nd [, 3rd…] value columns. Provide your own legend. Watch it instantly. And laugh all you want.

Read more about common_schema’s charting routines.

debugged_routines

The new debugged_routines view shows you which routines are currently “compiled with debug mode“.

I will write more on the state of rdebug in a future post.

Try it, get it

  • common_schema 2.1 comes with over 500 tests and fast growing.
  • It supports MySQL 5.1, 5.5, 5.6, Percona Server and MariaDB.
  • It has superb documentation (may I say so?) with a lot of examples & drill down into edge cases.

You are free to download and use it.

Your feedback is welcome! Indeed, many of this version’s improvements originated with community feedback.

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-2-1-released-advanced-improved-split-persistent-script-tables-more-schema-analysis-and-ahem-charts/feed 4 6398
opeark-kit revision 196 released https://shlomi-noach.github.io/blog/mysql/opeark-kit-revision-196-released https://shlomi-noach.github.io/blog/mysql/opeark-kit-revision-196-released#respond Tue, 07 May 2013 05:47:22 +0000 https://shlomi-noach.github.io/blog/?p=6335 This is a long due maintenance release of openark-kit. This release includes bugfixes and some enhancements, mainly to oak-online-alter-table.

oak-online-alter-table Changes / bug fixes include:

  • Support for keyword-named columns
  • Use of FORCE INDEX due to lack of MySQL’s ability for figure out the chunking key at all times
  • –sleep-ratio option added; allows for sleep time proportional to execution time (as opposed to constant sleep time with –sleep)
  • Support for chunk-retry (in case of deadlock) via –max-lock-retries)
  • Fixed order of cleanup
  • Fixed bug with verbose messages with non-integer chunking key
  • Fixed bug with single-row tables (people, no need for this tool for single row tables :))
  • Friendly verbose messages to remind you what’s being executed

oak-chunk-update changes includes:

  • Verbosing query comment if exists (friendly printing of what’s being executed)

(Do check out QueryScript’s split(); it’s a simple, server side solution which works almost same way as oak-chunk-update)

More issues and changes not listed here.

Download

openark-kit is released under the new BSD license, and is freely available.

]]>
https://shlomi-noach.github.io/blog/mysql/opeark-kit-revision-196-released/feed 0 6335
common_schema: 1.3: security goodies, parameterized split(), json-to-xml, query checksum https://shlomi-noach.github.io/blog/mysql/common_schema-1-3-security-goodies-parameterized-split-json-to-xml-query-checksum https://shlomi-noach.github.io/blog/mysql/common_schema-1-3-security-goodies-parameterized-split-json-to-xml-query-checksum#respond Mon, 14 Jan 2013 06:25:07 +0000 https://shlomi-noach.github.io/blog/?p=5941 common_schema 1.3 is released and is available for download. New and noteworthy in this version:

  • Parameterized split(): take further control over huge transactions by breaking them down into smaller chunks, now manually tunable if needed
  • duplicate_grantee(): copy+paste existing accounts along with their full set of privileges
  • similar_grants: find which accounts share the exact same set of privileges (i.e. have the same role)
  • json_to_xml(): translate any valid JSON object into its equivalent XML form
  • extract_json_value(): use XPath notation to extract info from JSON data, just as you would from XML
  • query_checksum(): given a query, calculate a checksum on the result set
  • random_hash(): get a 40 hexadecimal digits random hash, using a reasonably large changing input

Let’s take a closer look at the above:

Parameterized split()

split takes your bulk query and automagically breaks it down into smaller pieces. So instead of one huge UPDATE or DELETE or INSERT..SELECT transaction, you get many smaller transactions, each with smaller impact on I/O, locks, CPU.

As of 1.3, split() gets more exposed: you can have some control on its execution, and you also get a lot of very interesting info during operation.

Here’s an example of split() control:

set @script := "
  split({start:7015, step:2000} : UPDATE sakila.rental SET return_date = return_date + INTERVAL 1 DAY) 
    throttle 1;
";
call common_schema.run(@script);

In the above we choose a split size of 2,000 rows at a time; but we also choose to only start with 7015, skipping all rows prior to that value. Just what is that value? It depends on the splitting key (and see next example for just that); but in this table we can safely assume this is the rental_id PRIMARY KEY of the table.

You don’t have to use these control parameters. But they can save you some time and effort.

And, look at some interesting info about the splitting process:

set @script := "
  split(sakila.film_actor) 
    select $split_columns as columns, $split_range_start as range_start, $split_range_end as range_end
";
call common_schema.run(@script);
+----------------------+-------------+------------+
| columns              | range_start | range_end  |
+----------------------+-------------+------------+
| `actor_id`,`film_id` | '1','1'     | '39','293' |
+----------------------+-------------+------------+

+----------------------+-------------+------------+
| columns              | range_start | range_end  |
+----------------------+-------------+------------+
| `actor_id`,`film_id` | '39','293'  | '76','234' |
+----------------------+-------------+------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '76','234'  | '110','513' |
+----------------------+-------------+-------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '110','513' | '146','278' |
+----------------------+-------------+-------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '146','278' | '183','862' |
+----------------------+-------------+-------------+

+----------------------+-------------+-------------+
| columns              | range_start | range_end   |
+----------------------+-------------+-------------+
| `actor_id`,`film_id` | '183','862' | '200','993' |
+----------------------+-------------+-------------+

In the above you get to be told exactly how table splitting occurs: you are being told what columns are used to split the table, and what range of values is used in each step. There’s more to it: read the split() documentation.

similar_grants

Out of your 100 different grants, which ones share the exact same set of privileges? MySQL has non notion of roles, but that doesn’t mean the notion does not exist. Multiple accounts share the same restrictions and privileges. Use similar_grants to find out which. You might just realize there’s a few redundant accounts in your system.

mysql> SELECT * FROM similar_grants;
+-------------------------------+----------------+-------------------------------------------------------+
| sample_grantee                | count_grantees | similar_grantees                                      |
+-------------------------------+----------------+-------------------------------------------------------+
| 'root'@'127.0.0.1'            |              3 | 'root'@'127.0.0.1','root'@'myhost','root'@'localhost' |
| 'repl'@'10.%'                 |              2 | 'repl'@'10.%','replication'@'10.0.0.%'                |
| 'apps'@'%'                    |              1 | 'apps'@'%'                                            |
| 'gromit'@'localhost'          |              1 | 'gromit'@'localhost'                                  |
| 'monitoring_user'@'localhost' |              1 | 'monitoring_user'@'localhost'                         |
+-------------------------------+----------------+-------------------------------------------------------+

duplicate_grantee()

Provide an existing account, and name your new, exact duplicate account. The complete set of privileges is copied, and so is the password. duplicate_grantee() is your Copy+Paste of MySQL accounts.

Let’s begin with some pre-existing account and see how it duplicates:

mysql> show grants for 'world_user'@'localhost';
+------------------------------------------------------------------------------------------------------------------------+
| Grants for world_user@localhost                                                                                        |
+------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'world_user'@'localhost' IDENTIFIED BY PASSWORD '*6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9'      |
| GRANT ALL PRIVILEGES ON `world`.* TO 'world_user'@'localhost'                                                          |
| GRANT EXECUTE, ALTER ROUTINE ON FUNCTION `sakila`.`get_customer_balance` TO 'world_user'@'localhost' WITH GRANT OPTION |
+------------------------------------------------------------------------------------------------------------------------+

mysql> call duplicate_grantee('world_user@localhost', 'copied_user@10.0.0.%');
Query OK, 0 rows affected (0.06 sec)

mysql> show grants for 'copied_user'@'10.0.0.%';
+------------------------------------------------------------------------------------------------------------------------+
| Grants for copied_user@10.0.0.%                                                                                        |
+------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'copied_user'@'10.0.0.%' IDENTIFIED BY PASSWORD '*6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9'      |
| GRANT ALL PRIVILEGES ON `world`.* TO 'copied_user'@'10.0.0.%'                                                          |
| GRANT EXECUTE, ALTER ROUTINE ON FUNCTION `sakila`.`get_customer_balance` TO 'copied_user'@'10.0.0.%' WITH GRANT OPTION |
+------------------------------------------------------------------------------------------------------------------------+

The routine is quite relaxed in grantee format. copied_user@10.0.0.%, copied_user@’10.0.0.%’ and ‘copied_user’@’10.0.0.%’ are all just fine, and represent the same account. Saves trouble with all that quoting.

json_to_xml()

JSON is becoming increasingly popular in storing dynamically-structured data. XML’s tags overhead and its human unfriendliness make it less popular today. However, the two share similar concepts, and conversion between the two is possible. json_to_xml() will translate your valid JSON data into its equivalent XML format. The rules are simple (all-nodes-and-data, no attributes, arrays as repeating nodes, objects as subnodes) and the results are valid XML objects.

Sample data taken from json.org:

mysql> SET @json := '
{
  "menu": {
    "id": "file",
    "value": "File",
    "popup": {
      "menuitem": [
        {"value": "New", "onclick": "CreateNewDoc()"},
        {"value": "Open", "onclick": "OpenDoc()"},
        {"value": "Close", "onclick": "CloseDoc()"}
      ]
    }
  }
}
';

mysql> SELECT json_to_xml(@json) AS xml \G
*************************** 1. row ***************************
xml: <menu><id>file</id><value>File</value><popup><menuitem><value>New</value><onclick>CreateNewDoc()</onclick></menuitem><menuitem><value>Open</value><onclick>OpenDoc()</onclick></menuitem><menuitem><value>Close</value><onclick>CloseDoc()</onclick></menuitem></popup></menu>

Beautified form of the above result:

<menu>
  <id>file</id>
  <value>File</value>
  <popup>
    <menuitem>
      <value>New</value>
      <onclick>CreateNewDoc()</onclick>
    </menuitem>
    <menuitem>
      <value>Open</value>
      <onclick>OpenDoc()</onclick>
    </menuitem>
    <menuitem>
      <value>Close</value>
      <onclick>CloseDoc()</onclick>
    </menuitem>
  </popup>
</menu>

Note that linked examples page uses sporadically invented attributes; common_schema prefers using well-defined nodes.

extract_json_value()

Which means things you can do with XML can also be done with JSON. XPath is a popular extraction DSL, working not only for XML but also for Object Oriented structures (see Groovy’s nice integration of XPath into the language, or just commons-beans for conservative approach). JSON is a perfect data store for XPath expressions; by utilizing the translation between JSON and XML, one is now easily able to extract value from JSON (using same example as above):

mysql> SELECT extract_json_value(@json, '//id') AS result;
+--------+
| result |
+--------+
| file   |
+--------+

mysql> SELECT extract_json_value(@json, 'count(/menu/popup/menuitem)') AS count_items;
+-------------+
| count_items |
+-------------+
| 3           |
+-------------+

Implementations of json_to_xml() and extract_json_value() are CPU intensive. There is really just one justification for having these written in Stored Procedures: their lack in the standard MySQL function library. This is reason enough. Just be aware; test with BENCHMARK().

query_checksum()

It looks like this:

mysql> call query_checksum('select id from world.City where id in (select capital from world.Country) order by id');
+----------------------------------+
| checksum                         |
+----------------------------------+
| 5f35070b90b0c079ba692048c51a89fe |
+----------------------------------+

mysql> call query_checksum('select capital from world.Country where capital is not null order by capital');
+----------------------------------+
| checksum                         |
+----------------------------------+
| 5f35070b90b0c079ba692048c51a89fe |
+----------------------------------+

The two queries above yield with the same result set. As consequence, query_checksum() produces the same checksum value for both. The next query produces a different result set, hence a different checksum:

mysql> call query_checksum('select id from world.City where id in (select capital from world.Country) order by id limit 10');
+----------------------------------+
| checksum                         |
+----------------------------------+
| 997079c2dfca34ba87ae44ed8965276e |
+----------------------------------+

The routine actually invokes the given queries (modifying them a bit along the way) and uses a deterministic incremental checksum to get the final result.

Its use? As a handy built-in mechanism for comparing your table data. This is meant for relatively small result sets – not for your 20GB table. Inspired by Baron’s old trick, and works on server side (Windows/GUI/automated clients to benefit).

random_hash()

Random hashes come handy. The naive way to produce them is by executing something like SELECT SHA1(RAND()). However the RAND() function just doesn’t provide enough plaintext for the hash function. The SHA/MD5 functions expect a textual input, and produce a 160/128 bit long hash. The maximum char length of a RAND() result is 20 characters or so, and these are limited to the 0-9 digits. So at about 10^20 options for input, which is about 64 bit. Hmmmm. a 64 bit input to generate a 160 bit output? I don’t think so! random_hash() provides additional input in the form of your current status (at about 830 characters) as well as RAND(), SYSDATE() and server ID.

Bugfixes

Any bugfix adds at least one test; typically more. Currently with over 470 tests, common_schema is built to work.

Get common_schema

common_schema 1.3 is available under the permissive New BSD License. Find the latest download here.

If you like to support common_schema, I’m always open for ideas and contributions. Or you can just spread the word!

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-1-3-security-goodies-parameterized-split-json-to-xml-query-checksum/feed 0 5941
common_schema over traditional scripts https://shlomi-noach.github.io/blog/mysql/common_schema-over-traditional-scripts https://shlomi-noach.github.io/blog/mysql/common_schema-over-traditional-scripts#comments Wed, 12 Dec 2012 11:55:44 +0000 https://shlomi-noach.github.io/blog/?p=5509 If you are familiar with both openark kit and common_schema, you’ll notice I’ve incorporated some functionality already working in openark kit into common_schema, essentially rewriting what used to be a Python script into SQL/QueryScript.

What was my reasoning for rewriting good code? I wish to explain that, and provide with a couple examples.

I’m generally interested in pushing as much functionality into the MySQL server. When using an external script, one:

  • Needs the right dependencies (OS, Perl/Python version, Perl/Python modules).
  • Needs to provide with connection params,
  • Needs to get acquainted with a lot of command line options,
  • Is limited by whatever command line options are provided.
  • Has to invoke that script (duh!) to get the work done.

This last bullet is not so trivial: it means you can’t work some operation with your favorite GUI client, because it has no notion of your Perl script; does not run on the same machine where your Python code resides; simply can’t run those scripts for you.

With server-side code, functionality is accessible via any client. You run your operation via a query (e.g. CALL some_procedure). That can be done from your GUI client, your command line client, your event scheduler, your cronjob, all equally. You only need access to your MySQL server, which is trivial.

Of course, server side scripting is limited. Some stuff simply can’t be written solely on server side. If you want to consult your replicating slave; gracefully take action on user’s Ctrl+C, send data over the web, you’ll have to do it with an external tool. There are actually a lot of surprising limitations to things one would assume are possible on server side. You may already know how frustrated I am by the fact one can hardly get info from SHOW commands.

But, when it works, it shines

Let’s review a couple examples. The first one is nearly trivial. The second less so.

Example: getting AUTO_INCREMENT “free space”

openark kit offers oak-show-limits. It’s a tool that tells you if any of your AUTO_INCREMENT columns are running out of space (and so you might want to ALTER that INT to BIGINT).

It’s a very simple Python script. It gets your MAX(auto_increment_column) FROM tables_with_auto_increment, and compares that MAX value to the column type. It pre-computes:

max_values['tinyint'] = 2**8
max_values['smallint'] = 2**16
max_values['mediumint'] = 2**24
max_values['int'] = 2**32
max_values['bigint'] = 2**64

takes care of SIGNED/UNSIGNED, and does the math. Why is this tool such a perfect candidate for replacement on server side? For two reasons.

First, It turns out it takes very little effort to build a query which does the same. In which case it is also easy to build a view which provides the same.

Second, there’s this thing with command line arguments. The openark tool provides with –threshold (only output those columns where capacity is larger than x%), –database (only scan given database), –table (only for tables matching name), –column (only for columns matching name).

I don’t like this. See, the above is essentially an extra layer for saying:

  • WHERE auto_increment_ratio >= x
  • WHERE table_schema = …
  • WHERE table_name = …
  • WHERE column_name = …

The command line arguments each take the role of some WHERE/AND condition.Wow, what a 1-1 mapping. How about if I wanted the results sorted in some specific order? I would have to add a command line argument for that! How about only listing the SIGNED columns? I would have to add a command line argument for that, too! How about showing top 10? Yes, another command line argument!

Some of the above can be solved via shell scripting (sort -k 3 -n, head -n 10, etc.). But, hey, we’re OK with SQL, aren’t we? Why add now these two extra layers? Get to know all the command line options, get to script it? I love scripting, but this is an abuse.

So it makes much more sense, in my opinion, to SELECT * FROM auto_increment_columns WHERE table_schema=’my_db’ AND auto_increment_ratio >= 0.8 ORDER BY auto_increment_ratio DESC LIMIT 10. It doesn’t require SQL-fu skills, just basic SQL skills which every DBA and DB user are expected to have. And it allows one to work from whatever environment one feels comfortable with. Heck, with your GUI editor you can probably get off with it by right-clicking and left-clicking your mouse buttons, never typing one character.

Example: blocking user accounts

The above mapped very easily to a query, and was just a read-only query. What if we had to modify data? oak-block-accounts is a tool which allows one to block grantees from logging in, then releasing them later on. common_schema offers sql_accounts and eval().

Let’s skip the command line arguments issue, as it is identical to the above. How should we best provide with “taking action” interface? A script would have no problem to first SELECT stuff, then UPDATE, or SET PASSWORD, or DROP etc. How easy is it to do the same on server side?

The immediate solution is to write a stored procedure to do that. I reject the idea. Why? Because the procedure would look like this:

PROCEDURE block_account(user VARCHAR(64), host VARCHAR(64), only_if_empty_password BOOL, ...);

Can you see where I’m getting at? Doing the above re-introduces command line options, this time disguised as procedure parameters. We would again have to list all available filtering methods, only this time things are worse: since stored procedures have no such notion as overloading, and change to the params will break compatibility. Once we introduce this routine, we’re stuck with it.

common_schema tries to stay away as far as it can from this pitfall. It presents another solution: the view solution. Just as with auto_increment_columns, SELECT your way to get the right rows. But this time, the result is a SQL query:

mysql> SELECT sql_block_account FROM sql_accounts WHERE USER = 'gromit';
+-------------------------------------------------------------------------------------+
| sql_block_account                                                                   |
+-------------------------------------------------------------------------------------+
| SET PASSWORD FOR 'gromit'@'localhost' = '752AA50E562A6B40DE87DF0FA69FACADD908EA32*' |
+-------------------------------------------------------------------------------------+

Do your own WHERE/AND combination in SQL. But, how to take action? Our view cannot take the actual action for us!

eval() is at the core of many common_schema operations, like this one:

CALL eval("SELECT sql_block_account FROM sql_accounts WHERE USER = 'gromit'");

The SET PASSWORD query just got evaluated. Meaning it was executed. eval() is a very powerful solution.

Conclusion

I prefer stuff on server side. It requires basic SQL skills (or a smart GUI editor), and allows you easy access to a lot of functionality, removing dependency requirements. It is not always possible, and external scripts can do miracles not possible on server side, but server side scripting has its own miracles.

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-over-traditional-scripts/feed 2 5509
Purging old rows with QueryScript: three use cases https://shlomi-noach.github.io/blog/mysql/purging-old-rows-with-queryscript-three-use-cases https://shlomi-noach.github.io/blog/mysql/purging-old-rows-with-queryscript-three-use-cases#respond Wed, 14 Nov 2012 09:15:35 +0000 https://shlomi-noach.github.io/blog/?p=5157 Problem: you need to purge old rows from a table. This may be your weekly/monthly cleanup task. The table is large, the amount of rows to be deleted is large, and doing so in one big DELETE is too heavy.

You can use oak-chunk-update or pt-archiver to accomplish the task. You can also use server side scripting with QueryScript, offering a very simple syntax with no external scripting, dependencies and command line options.

I wish to present three cases of row deletion, with three different solutions. In all cases we assume some TIMESTAMP column exists in table, by which we choose to purge the row. In all cases we assume we wish to purge rows older than 1 month.

We assume the naive query is this:

DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH

Case 1: TIMESTAMP column is indexed

I almost always index a timestamp column, if only for being able to quickly purge data (but usually also to slice data by date). In this case where the column is indexed, it’s very easy to figure out which rows are older than 1 month.

We break the naive query into smaller parts, and execute these in sequence:

while (DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH ORDER BY row_timestamp LIMIT 1000)
  throttle 1;

How does the above work?

QueryScript accepts a DELETE statement as a conditional expression in a while loop. The expression evaluates to TRUE when the DELETE affects rows. Once the DELETE ceases to affect rows (when no more rows match the WHERE condition), the while loop terminates.

The throttle command allows us to play nice: by throttling we increase the total runtime through sleeping in between loop iterations.

Case 2: TIMESTAMP column is not indexed, and there is no heuristic for matching rows

This case is hardest to tackle by means of optimization: there is no index, and we cannot assume or predict anything about the distribution of old rows. We must therefore scan the entire table so as to be able to purge old rows.

This does not mean we have to do one huge full table scan. As long as we have some way to split the table, we are still good. We can utilize the PRIMARY KEY or another UNIQUE KEY so as to break the table into smaller, distinct parts, and work our way on these smaller chunks:

split (DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH)
  throttle 1;

The split statement will automagically calculate the chunks and inject filtering conditions onto the query, such that each execution of the query relates to a distinct set of rows.

Case 3: TIMESTAMP column not indexed, but known to be monotonic

This is true for many tables. Rows with AUTO_INCREMENT columns and TIMESTAMP columns are created with CURRENT_TIMESTAMP values. This makes for a monotonic function: as the AUTO_INCREMENT grows, so does the TIMESTAMP.

This makes for the following observation: it we iterate the table row by row, and reach a point where the current row is not old, then we can stop looking. Timestamps will only increase by value, which means further rows only turn to be newer.

With this special case at hand, we can:

split (DELETE FROM my_schema.my_table WHERE row_timestamp < CURDATE() - INTERVAL 1 MONTH) {
  if ($split_rowcount = 0)
    break;
  throttle 1;
}

split is a looping device, and a break statement works on split just as on a while statement.

split provides with magic variables which describe current chunk status. $split_rowcount relates to the number of rows affected by last chunk query. No more rows affected? This means we’ve hit recent rows, and we do not expect to find old rows any further. We can stop looking.

]]>
https://shlomi-noach.github.io/blog/mysql/purging-old-rows-with-queryscript-three-use-cases/feed 0 5157
common_schema 1.2: security, partition management, processes, QueryScript goodies https://shlomi-noach.github.io/blog/mysql/common_schema-1-2-security-partition-management-processes-queryscript-goodies https://shlomi-noach.github.io/blog/mysql/common_schema-1-2-security-partition-management-processes-queryscript-goodies#comments Tue, 13 Nov 2012 12:25:38 +0000 https://shlomi-noach.github.io/blog/?p=5539 common_schema 1.2 is released! This version comes shortly after 1.1, yet contains quite a few interesting goodies:

  • Account blocking
  • Security audit
  • RANGE partition management
  • Slave status
  • Better blocking and idle transaction management
  • QueryScript goodies:
    • echo, report
    • while-otherwise statement; foreach-otherwise statement
    • Better variable scope handling
    • Complete support for variable expansion
    • Transaction support within QueryScript
  • More summary info and SQL statements in processlist-related views

A closer look at these follows:

Account blocking

A new view called sql_accounts, inspired by oak-block-account (also see here and here) provides with the means of blocking use accounts (and releasing them, of course) without revoking their privileges. It offers the SQL statements to block an account (by modifying its password in a symmetric way) and to release an account (by modifying its password back to normal). It really works like a charm. Together with killall() and sql_accounts this gives the administrator great control over accounts.

Security audit

Imported from openark kit, and implemented via QueryScript, the security_audit() procedure will audit your accounts, passwords and general settings to find problems, pitfalls and security hazards. I will write more on this later.

RANGE partition management

The sql_range_partitions view manages your RANGE and RANGE COLUMNS partitioned tables by providing with the SQL statements to drop oldest partitions and to create the next (in sequence) partitions. See my earlier post.

Slave status

This is a hack providing a minified version of SHOW SLAVE STATUS, but as a view (slave_status). It only provides with 5 columns:

mysql> SELECT * FROM slave_status \G
*************************** 1. row ***************************
 Slave_Connected_time: 82077
     Slave_IO_Running: 1
    Slave_SQL_Running: 1
        Slave_Running: 1
Seconds_Behind_Master: 5

For me, the Seconds_Behind_Master is one critical value I am interested in getting using a query. So here it is.

mysql> SELECT (Seconds_Behind_Master < 10) IS TRUE AS slave_is_up_to_date FROM slave_status;
+---------------------+
| slave_is_up_to_date |
+---------------------+
|                   1 |
+---------------------+

QueryScript goodies

  • while-otherwise statement: while (some_condition) { … } otherwise { /* this gets executed if the while never performs a single iteration */ }
  • foreach-otherwise statement, likewise
  • echo, report statements: echo your statements before applying them, or just echo your comments along the code. Generate a (beautified) report at the end of script execution (which is how security_audit() works).
  • Better variable scopes: now allowing variables of same name to be declared when their scopes do not overlap. This makes for the expected behavior a programmer would expect.
  • Complete variable expansion handling: expanded variables are now recognized anywhere within the script, including inside a while or foreach expression.
  • Transactions are now handled by QueryScript and immediately delegated to MySQL. This completes the transaction management in QueryScript. Just start transaction, commit or rollback at will.

InnoDB idle transactions, blocking transactions

The innodb_transactions view now lists idle transactions, as well as their idle time. It also provides with the SQL statements to kill the query or connection for each transaction. This allows for a quick track or track-and-kill of idle transactions.

The innodb_locked_transactions view now offers the wait time and SQL statements for killing the query or connection of a blocking  transaction. This allows for a quick track or track-and-kill long time blocking transactions.

I will write more in depth on both in a future post.

Processlist-related views

The new processlist_states view aggregates processlist by thread state. This view, and all other processlist views now provide with median or 95% median runtime for processes, in addition to the less informative AVG provided earlier.

Get it!

common_schema is free and licensed under the New BSD License. It is nothing but a SQL file, so you simply import it into your MySQL server. common_schema installs on any MySQL >= 5.1 server, including Percona Server and MariaDB, tested on 5.1, 5.5 and 5.6 RC.

Go to common_schema download page.

]]>
https://shlomi-noach.github.io/blog/mysql/common_schema-1-2-security-partition-management-processes-queryscript-goodies/feed 2 5539