SQL: finding a user's country/region based on IP

May 26, 2009

I've encountered the same problem twice for different customers, so I guess it's worth a discussion.

A common task for web applications is to find out the country/region of a user, based on her IP address, as can be detected in the HTTP request. Depending on the country of origin, the website can translate dates for different time zones, can change locale settings, and, perhaps most commonly, show advertisements in her native language.

To start with, there's a table which lists the IP ranges per country/region. Let's assume we're only dealing with IPv4:

CREATE TABLE regions_ip_range (
  regions_ip_range_id INT UNSIGNED AUTO_INCREMENT,
  country VARCHAR(64) CHARSET utf8,
  region VARCHAR(64) CHARSET utf8,
  start_ip INT UNSIGNED,
  end_ip INT UNSIGNED,
  …
  PRIMARY KEY(regions_ip_range_id),
  ...
);

The table is fixed, and is populated. Now the question arises: how do we query this table, and which indexes should be created?

The wrong way

The form I've encountered is as follows: an index is declared on regions_ip_range:

KEY ip_range_idx (start_ip, end_ip)

And the query goes like this:

SELECT * FROM regions_ip_range
WHERE my_ip BETWEEN start_ip AND end_ip

It takes a grasp of indexes to understand that this is wrong. I'm not saying the results are wrong, just that the query performance is bad. Let's rewrite the query to understand why. The following query is the exact equal of the above:

SELECT * FROM regions_ip_range
WHERE my_ip >= start_ip AND my_ip <= end_ip

Can you see the problem?

There's a range condition on the first indexed column (start_ip). This automatically negates the use of the second column (end_ip). Reversing the order won't do, since there's also a range condition on end_ip.

Effectively, if this were the only query we were executing, we would get the same performance had we defined the following index:

KEY ip_range_idx (start_ip)

Now that doesn't look promising. It's fair to guess (as happens in reality) that for the vast majority of ip addresses, MySQL would rather perform a full table scan than use the index.

Another wrong way

When pointing this to people, the natural response is: "OK, then, let's index like this:"

KEY start_ip_idx (start_ip)
KEY end_ip_idx (end_ip)

Now we have two indexes, one on each address. But that won't do at all. Even if we assume MySQL will use both indexes for our query, and do an index_merge, we won't have good performance. Consider: you can't have both indexes be selective for any given IP. Either the IP is close to the beginning of the global range (in which case the 'my_ip >= start_ip' part is not selective) or it is nearer the upper bound (in which case the 'my_ip <= end_ip' part is not selective), or is somewhere in the middle, in which case none is selective.

In fact, I cannot imagine MySQL would choose to use index_merge at all, and so at most one index is used, if not full table scan again.

A solution

An important step towards a solution is the realization that the IP ranges are mutually exclusive. No IP can lie in any two ranges, just one (at least, this is the data I've worked with. If you have hierarchical ranges, you'll need to make adjustments). This means I don't really need to index both columns. One would suffice. Say I was to put the following index:

KEY start_ip_idx (start_ip)

We've seen that the presented query won't run well on this. Can we rewrite the query as well? Sure! Here's one that will work:

SELECT * FROM regions_ip_range
WHERE start_ip <= my_ip
ORDER BY start_ip DESC LIMIT 1

What we're asking for, now, is the first range for which our IP is larger than the range's start, reading backwards. What the optimizer needs to do is find the first entry for which start_ip <= my_ip, using the index, and then... oh, there's no need to go on, as we have LIMIT 1.

If this seems confusing, you can do the opposite. Define this key:

KEY end_ip_idx (end_ip)

And use this query, instead:

SELECT * FROM regions_ip_range
WHERE my_ip <= end_ip
ORDER BY end_ip ASC LIMIT 1

It's interesting that EXPLAIN would still claim it's going to scan a large number of rows, since it does not take the LIMIT 1 into account.

I've written before about the differences between storage engines in the way they recommend the optimizer to use (or not to use) an index. So you may need to end up with a FORCE_INDEX after all.

Assumptions

I've made a few assumptions here:

  1. The table lists ranges are covering: they start with 0.0.0.0 and end with 255.255.255.255.
  2. There are no 'holes' in the table. Meaning there's bound to be a range for any given IP.
  3. IP ranges are mutually exclusive (no hierarchical IP ranges)

If the first two assumptions are not met, it should be checked, once the query returns, that my_ip is indeed between start_ip and end_ip.

If assumption #3 is not met, the data can be split to two tables: one must hold the mutually exlusive data; the second one may contain whatever data you have, possibly utilizing some hierarchial algorithm such as nested sets etc.

tags: , ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

20 Comments to "SQL: finding a user's country/region based on IP"

  1. Rich wrote:

    Smart stuff.

    On a tangential note: as an English-speaker living in Switzerland, I find the most common use case of IP geolocation (serving ads in the user's "native" language) to be extremely frustrating.

    There are three (some say four) official languages in Switzerland: German, French, Italian (and Romantsch) and the English-speaking expat community is also fairly large (around 5% of the population).

    I'm currently in Basel, but thanks to dynamic IP allocation, geolocation has been known to place me in Geneva, Zurich, Aarau and even Bellinzona. As such, I see ads served in German, French or Italian. The most frustrating thing about this is that the biggest offender is Facebook, which is well aware of my native language from my profile!

    Sorry about the tangent, but that aside, this is a great post - interesting indeed that EXPLAIN ignores the LIMIT clause.

  2. shlomi wrote:

    @Rich,

    Thanks.
    What an annoyance this must be with all those languages being mixed up!
    I'm a Hebrew speaker myself, living in Israel, and still I find it shocking to read Hebrew advertisements when I surf through English/American websites.

    To be honest, I just can't stand it that in some websites I get "job offers" in my province.

    Worse yet, of course, are "lonely women who want to meet me" in my province, though a funny thing here: for some reason, and I suspect a smart and cunning Israeli mind behind it, much of the information about the provinces in Israel is inaccurate. So that the "lonely woman who wants to meet me" lives in the village of Nehalim. That's so funny, because Nehalim has maybe 300 families, all religious, and yet, so many pretty lonely women live there!

    With regard to EXPLAIN ignoring the LIMIT: the LIMIT is not ignored in the query plan (so, without LIMIT you may get a full table scan, with LIMIT you get indexed search), but the number of estimated rows, as reported in EXPLAIN, does not seem to be affected by LIMIT.

    Regards

  3. Jason Stubbs wrote:

    You are completely right about only needing an index on one of the fields. However, I don't completely agree with your final query. Taking the original query and just tacking on the ORDER BY and LIMIT to it will get you the performance increase without decreasing readability of the query through hidden assumptions. (Non-hierarchical data is more a design decision.)

  4. shlomi wrote:

    @Jason,
    If I understand correctly, you're suggesting an index on a single column (say, on end_ip), and querying:

    SELECT * FROM regions_ip_range
    WHERE my_ip BETWEEN start_ip AND end_ip
    ORDER BY end_ip ASC LIMIT 1

    If this is what you meant, then there's an issue when the contiguity assumption fails. In this case, and when my_ip is inside one of the "holes" this query keeps searching for rows where the end_ip >= start_ip, until it ultimately fails. So it searches a lot of rows. Whereas the query I suggested ultimately searches one row, albeit it needs to be verified that the result is valid, when said assumption does not hold.

    The non-hierarchical data is an altogether different assumption which cannot be overcome using the above technique, hence the mentioned split.

  5. Teg wrote:

    In my opinion using a database for this is wrong. Use libgeoip instead. ;)

  6. Jeremy Cole wrote:

    Hi Schlomi,

    I've also written about this before:

    http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/

    For the most part, this is an interesting case of MySQL's optimizer being *incredibly* hard to please. I tested quite a few variants of the query in the course of playing with GIS for it. It's quite interesting how differently a very similar query can perform.

    Regards,

    Jeremy

  7. shlomi wrote:

    @Teg,

    Thanks, maybe I'll take the time to see how they implemented it in libgeoip. Do you know exactly how they did it?

  8. shlomi wrote:

    @Jeremy,

    Thanks for the link to your post, good read! Interesting conversion to GIS.

    As another optimization, we had to run a one time long conversion and matching of all ranges at once, so the solution was still too slow. What I did was to make the table an InnoDB with the PRIMARY KEY on from_ip.
    With this enhancement, total runtime dropped by hundreds of percents. I don't have numbers or measures though.

    Regards

  9. PB wrote:

    You are completely right about only needing an index on one of the fields. However, I don't completely agree with your final query. Taking the original query and just tacking on the ORDER BY and LIMIT to it will get you the performance increase without decreasing readability of the query through hidden assumptions. (Non-hierarchical data is more a design decision.)

  10. Michael wrote:

    Hi,

    My solution to trying to speed up IP lookups was to group the start IP addresses into their Class B network group (i.e. the upper 16-bits) and using that value and the ip_start as a key pair.

    There's 27K+ unique class B values which can be used to group the IP addreses effectively from an index-pair perspective.

    To construct the class B value:

    classb is declared as INTEGER NOT NULL
    The index is (classb, ip_start)

    UPDATE geo_ip_blocks SET classb = (ip_start >> 16)

    Here's an example query to look up "157.166.226.26" (aka http://www.cnn.com):

    SELECT * FROM `geo_ip_blocks` WHERE classb=40358 and (2644959770 >= ip_start and 2644959770 <= ip_end) LIMIT 1;

    40358 = 157.166.226.26 shifted by 16 bits
    2644959770 = 157.166.226.26

    Here's the EXPLAIN:

    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: geo_ip_blocks
    type: range
    possible_keys: ip_start,ip_end,index_on_ip_start,classb_begin,index_on_classb_ip_start
    key: index_on_classb_ip_start
    key_len: 8
    ref: NULL
    rows: 14
    Extra: Using where

    Notice how the estimated rows to scan is only 14 for this particular query?

  11. Michael wrote:

    FYI - the geo_ip_blocks table has 4M rows.

  12. shlomi wrote:

    @PB - #9
    suspiciously identical comment to Jason #3??
    Anyway, see my comment, #4

  13. shlomi wrote:

    @Michael,
    very nice!
    Out of curiosity: is there no way a range consists of more than one classB?

  14. Jason Stubbs wrote:

    I didn't consider the case of a no-match when I attacked this problem myself way back when. Although the SQL starts to get a little ugly, using a sub-query should fix the performance hit there too.

    SELECT * FROM (
    SELECT * FROM regions_ip_range
    WHERE my_ip >= start_ip
    ORDER BY start_ip LIMIT 1
    ) AS t
    WHERE my_ip <= end_ip;

    I haven't tested this performance-wise but it seems right theoretically... Either way, your spot on both of your "wrong ways". :)

  15. shlomi wrote:

    @Jason,

    This seems right to me, and was what I meant... :)

  16. Michael wrote:

    @jason

    Unfortunately, that expression is worse. The sub-select will grab ALL rows above the starting IP, create a derived table, and then scan through that derived table which *does not* have an index. The lower the IP number is, the more rows that will have been dumped into the derived table.

    Under MySQL, derived tables do not inherit any indexes from the original table, or tables, it was created from. It would be awesome if it did, but it doesn't.

    The MySQL optimizer ain't too happy about it:

    explain SELECT * FROM (select * from geo_ip_blocks WHERE 412306898 >= ip_begin ORDER BY ip_begin limit 1) AS T where 412306898 <= ip_end\G

    *************************** 1. row ***************************
    id: 1
    select_type: PRIMARY
    table: NULL
    type: NULL
    possible_keys: NULL
    key: NULL
    key_len: NULL
    ref: NULL
    rows: NULL
    Extra: Impossible WHERE noticed after reading const tables
    *************************** 2. row ***************************
    id: 2
    select_type: DERIVED
    table: geo_ip_blocks
    type: range
    possible_keys: ip_begin,index_on_ip_begin,classb_begin
    key: ip_begin
    key_len: 4
    ref: NULL
    rows: 229512
    Extra: Using where

    Now look at the following SQL which is what we all naturally want to try first when attacking this problem:

    explain SELECT * FROM geo_ip_blocks WHERE 412306898 BETWEEN ip_begin AND ip_end\G
    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: geo_ip_blocks
    type: range
    possible_keys: ip_begin,ip_end,index_on_ip_begin,classb_begin
    key: ip_begin
    key_len: 4
    ref: NULL
    rows: 229512
    Extra: Using where

    In both cases the estimated number of rows to scan is the same: 229512. The first SQL statement is going to be worse since there's a second step involved with having to create a derived table.

    The second SQL only has one action involved, which is exactly the same as the one in the first SQL.

  17. shlomi wrote:

    @Michael,

    I disagree. The '229512' rows reported in first version's subquery does not really hold. That's what's peculiar about EXPLAIN: it does not (maybe can not?) take the LIMIT into consideration. I assure you (by testing it and seeing the performance, that is) that the number reported is not actually the number of rows to be scanned in this plan. The real number is 1.

    Please notice that this is regardless of the subquery: see the EXPLAIN plan discussion within my post.

    The second step involved is really nothing, since it only needs to scan a table of a single row. So no real impact on memory or CPU here.

    Whereas in the second version you've provided, and as I explained in my post, 229512 may actually be a reasonable estimation for number of rows to scan.

    Regards

  18. Michael wrote:

    @shlomi,

    D'oh! Yes, you're completely right. I wasn't reading sql statement correctly, and there was a typo in it (using >= instead of <=) in my test cases. (Note to self, do not comment until after coffee has been had.)

    I did do some benchmarking on some stuff here.

    The winner is definitely the "SELECT * (SELECT * .. WHERE my_ip <= ip_begin..) AS T ..." at least on my box configuration. - approx 5K queries/sec (single client)

    The worse is of course, the "ip BETWEEN start and end" at a whole whopping 3 queries/sec, and that's with the optimizer claiming to use start ip index.

    My version using classb was able to crank out 2K queries/sec. Not bad, but only half as good as the SELECT(SELECT *) AS T.. one.

  19. shlomi wrote:

    @Michael,

    Great! Thanks for the benchmarks!

  20. Joshua K Roberson wrote:

    Here are different ways to store an IP or an IP range as well as query for the IPs.
    http://strictcoder.blogspot.com/2009/08/different-ways-to-query-for-ip-in-your.html

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org