We are used to issue queries with an IN clause of the form:
... where state in ('started', 'completed', 'failed') ...
However I’ve had a few cases where I used an inverted format. Here’s one use case followed by an inverted IN clause.
Dynamic query building
Say we have this function:
GetLaggingSlaves(clusterName string)
Which, based on whether given clusterName is empty or not, would return list of all lagging slaves, or only those in the given cluster, respectively:
SELECT hostname FROM database_instance WHERE slave_lag_seconds > 60
or
SELECT hostname FROM database_instance WHERE cluster_name = 'mycluster:3306' AND slave_lag_seconds > 60
To avoid SQL injection you would create a prepared statement, but you don’t want to copy+paste everything, and so you build your query dynamically based on the value of clusterName. You want to end up with either:
SELECT hostname FROM database_instance WHERE slave_lag_seconds > ?, acceptableLag
or
SELECT hostname FROM database_instance WHERE cluster_name = ? AND slave_lag_seconds > ?, clusterName, acceptableLag
Dynamic query building is good practice, but a little pain (BTW I’m designing a new, simple & non intrusive query builder for golang). Is there a way to just get away with one query that has it all?
This is one option:
SELECT hostname FROM database_instance WHERE (cluster_name = ? OR ? = '') AND slave_lag_seconds > ?,clusterName, clusterName, acceptableLag
and it is somewhat painful to list clusterName twice in the arguments list. This is where the inverted IN clause kicks in. It will negate usage of an index, and may look strange at first glance, but as you get used to it it just becomes another pattern:
SELECT hostname FROM database_instance WHERE ? IN (cluster_name, '') AND slave_lag_seconds > ?, clusterName, acceptableLag
So when clusterName is empty, all rows where slave_lag_seconds > acceptableLag are fetched; when non empty, only those where cluster_name equals our value.
Very interesting.
One question: how does the optimizer react to this type of SQL statements (index on cluster_name) ?
Thanks for sharing.
As I suggested within the post, this negates the use of index:
This may, at first glance, make you want to avoid this trick altogether, but it really becomes more viable in your common website “filter by multiple options” type of queries (where multiple columns are involved, and there’s no way for you to index any combination of them), or when you know you’re already filtering by some other column that is indexed.
X000 above is to obscure production number of servers.