I’ve encountered the same problem twice for different customers, so I guess it’s worth a discussion.
A common task for web applications is to find out the country/region of a user, based on her IP address, as can be detected in the HTTP request. Depending on the country of origin, the website can translate dates for different time zones, can change locale settings, and, perhaps most commonly, show advertisements in her native language.
To start with, there’s a table which lists the IP ranges per country/region. Let’s assume we’re only dealing with IPv4:
CREATE TABLE regions_ip_range ( regions_ip_range_id INT UNSIGNED AUTO_INCREMENT, country VARCHAR(64) CHARSET utf8, region VARCHAR(64) CHARSET utf8, start_ip INT UNSIGNED, end_ip INT UNSIGNED, … PRIMARY KEY(regions_ip_range_id), ... );
The table is fixed, and is populated. Now the question arises: how do we query this table, and which indexes should be created?
The wrong way
The form I’ve encountered is as follows: an index is declared on regions_ip_range:
KEY ip_range_idx (start_ip, end_ip)
And the query goes like this:
SELECT * FROM regions_ip_range WHERE my_ip BETWEEN start_ip AND end_ip
It takes a grasp of indexes to understand that this is wrong. I’m not saying the results are wrong, just that the query performance is bad. Let’s rewrite the query to understand why. The following query is the exact equal of the above:
SELECT * FROM regions_ip_range WHERE my_ip >= start_ip AND my_ip <= end_ip
Can you see the problem?
There’s a range condition on the first indexed column (start_ip). This automatically negates the use of the second column (end_ip). Reversing the order won’t do, since there’s also a range condition on end_ip.
Effectively, if this were the only query we were executing, we would get the same performance had we defined the following index:
KEY ip_range_idx (start_ip)
Now that doesn’t look promising. It’s fair to guess (as happens in reality) that for the vast majority of ip addresses, MySQL would rather perform a full table scan than use the index.
Another wrong way
When pointing this to people, the natural response is: “OK, then, let’s index like this:”
KEY start_ip_idx (start_ip) KEY end_ip_idx (end_ip)
Now we have two indexes, one on each address. But that won’t do at all. Even if we assume MySQL will use both indexes for our query, and do an index_merge, we won’t have good performance. Consider: you can’t have both indexes be selective for any given IP. Either the IP is close to the beginning of the global range (in which case the ‘my_ip >= start_ip‘ part is not selective) or it is nearer the upper bound (in which case the ‘my_ip <= end_ip‘ part is not selective), or is somewhere in the middle, in which case none is selective.
In fact, I cannot imagine MySQL would choose to use index_merge at all, and so at most one index is used, if not full table scan again.
A solution
An important step towards a solution is the realization that the IP ranges are mutually exclusive. No IP can lie in any two ranges, just one (at least, this is the data I’ve worked with. If you have hierarchical ranges, you’ll need to make adjustments). This means I don’t really need to index both columns. One would suffice. Say I was to put the following index:
KEY start_ip_idx (start_ip)
We’ve seen that the presented query won’t run well on this. Can we rewrite the query as well? Sure! Here’s one that will work:
SELECT * FROM regions_ip_range WHERE start_ip <= my_ip ORDER BY start_ip DESC LIMIT 1
What we’re asking for, now, is the first range for which our IP is larger than the range’s start, reading backwards. What the optimizer needs to do is find the first entry for which start_ip <= my_ip, using the index, and then… oh, there’s no need to go on, as we have LIMIT 1.
If this seems confusing, you can do the opposite. Define this key:
KEY end_ip_idx (end_ip)
And use this query, instead:
SELECT * FROM regions_ip_range WHERE my_ip <= end_ip ORDER BY end_ip ASC LIMIT 1
It’s interesting that EXPLAIN would still claim it’s going to scan a large number of rows, since it does not take the LIMIT 1 into account.
I’ve written before about the differences between storage engines in the way they recommend the optimizer to use (or not to use) an index. So you may need to end up with a FORCE_INDEX after all.
Assumptions
I’ve made a few assumptions here:
- The table lists ranges are covering: they start with 0.0.0.0 and end with 255.255.255.255.
- There are no ‘holes’ in the table. Meaning there’s bound to be a range for any given IP.
- IP ranges are mutually exclusive (no hierarchical IP ranges)
If the first two assumptions are not met, it should be checked, once the query returns, that my_ip is indeed between start_ip and end_ip.
If assumption #3 is not met, the data can be split to two tables: one must hold the mutually exlusive data; the second one may contain whatever data you have, possibly utilizing some hierarchial algorithm such as nested sets etc.
FYI – the geo_ip_blocks table has 4M rows.
@PB – #9
suspiciously identical comment to Jason #3??
Anyway, see my comment, #4
@Michael,
very nice!
Out of curiosity: is there no way a range consists of more than one classB?
I didn’t consider the case of a no-match when I attacked this problem myself way back when. Although the SQL starts to get a little ugly, using a sub-query should fix the performance hit there too.
SELECT * FROM (
SELECT * FROM regions_ip_range
WHERE my_ip >= start_ip
ORDER BY start_ip LIMIT 1
) AS t
WHERE my_ip <= end_ip;
I haven’t tested this performance-wise but it seems right theoretically… Either way, your spot on both of your “wrong ways”. 🙂
@Jason,
This seems right to me, and was what I meant… 🙂