Comments on: SQL: finding a user’s country/region based on IP

By: Joshua K Roberson

Joshua K Roberson — Sun, 02 Aug 2009 04:32:26 +0000

Here are different ways to store an IP or an IP range as well as query for the IPs. http://strictcoder.blogspot.com/2009/08/different-ways-to-query-for-ip-in-your.html

By: shlomi

shlomi — Thu, 28 May 2009 03:13:33 +0000

@Michael,

Great! Thanks for the benchmarks!

By: Michael

Michael — Wed, 27 May 2009 21:02:58 +0000

@shlomi,

D’oh! Yes, you’re completely right. I wasn’t reading sql statement correctly, and there was a typo in it (using >= instead of <=) in my test cases. (Note to self, do not comment until after coffee has been had.)

I did do some benchmarking on some stuff here.

The winner is definitely the “SELECT * (SELECT * .. WHERE my_ip <= ip_begin..) AS T …” at least on my box configuration. – approx 5K queries/sec (single client)

The worse is of course, the “ip BETWEEN start and end” at a whole whopping 3 queries/sec, and that’s with the optimizer claiming to use start ip index.

My version using classb was able to crank out 2K queries/sec. Not bad, but only half as good as the SELECT(SELECT *) AS T.. one.

By: shlomi

shlomi — Wed, 27 May 2009 19:29:45 +0000

@Michael,

I disagree. The ‘229512’ rows reported in first version’s subquery does not really hold. That’s what’s peculiar about EXPLAIN: it does not (maybe can not?) take the LIMIT into consideration. I assure you (by testing it and seeing the performance, that is) that the number reported is not actually the number of rows to be scanned in this plan. The real number is 1.

Please notice that this is regardless of the subquery: see the EXPLAIN plan discussion within my post.

The second step involved is really nothing, since it only needs to scan a table of a single row. So no real impact on memory or CPU here.

Whereas in the second version you’ve provided, and as I explained in my post, 229512 may actually be a reasonable estimation for number of rows to scan.

Regards

By: Michael

Michael — Wed, 27 May 2009 17:15:12 +0000

@jason

Unfortunately, that expression is worse. The sub-select will grab ALL rows above the starting IP, create a derived table, and then scan through that derived table which *does not* have an index. The lower the IP number is, the more rows that will have been dumped into the derived table.

Under MySQL, derived tables do not inherit any indexes from the original table, or tables, it was created from. It would be awesome if it did, but it doesn’t.

The MySQL optimizer ain’t too happy about it:

explain SELECT * FROM (select * from geo_ip_blocks WHERE 412306898 >= ip_begin ORDER BY ip_begin limit 1) AS T where 412306898 <= ip_end\G

*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Impossible WHERE noticed after reading const tables
*************************** 2. row ***************************
id: 2
select_type: DERIVED
table: geo_ip_blocks
type: range
possible_keys: ip_begin,index_on_ip_begin,classb_begin
key: ip_begin
key_len: 4
ref: NULL
rows: 229512
Extra: Using where

Now look at the following SQL which is what we all naturally want to try first when attacking this problem:

explain SELECT * FROM geo_ip_blocks WHERE 412306898 BETWEEN ip_begin AND ip_end\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: geo_ip_blocks
type: range
possible_keys: ip_begin,ip_end,index_on_ip_begin,classb_begin
key: ip_begin
key_len: 4
ref: NULL
rows: 229512
Extra: Using where

In both cases the estimated number of rows to scan is the same: 229512. The first SQL statement is going to be worse since there’s a second step involved with having to create a derived table.

The second SQL only has one action involved, which is exactly the same as the one in the first SQL.

By: shlomi

shlomi — Wed, 27 May 2009 08:40:00 +0000

@Jason,

This seems right to me, and was what I meant… 🙂

By: Jason Stubbs

Jason Stubbs — Wed, 27 May 2009 06:52:10 +0000

I didn’t consider the case of a no-match when I attacked this problem myself way back when. Although the SQL starts to get a little ugly, using a sub-query should fix the performance hit there too.

SELECT * FROM (
SELECT * FROM regions_ip_range
WHERE my_ip >= start_ip
ORDER BY start_ip LIMIT 1
) AS t
WHERE my_ip <= end_ip;

I haven’t tested this performance-wise but it seems right theoretically… Either way, your spot on both of your “wrong ways”. 🙂

By: shlomi

shlomi — Wed, 27 May 2009 06:23:08 +0000

@Michael,
very nice!
Out of curiosity: is there no way a range consists of more than one classB?

By: shlomi

shlomi — Wed, 27 May 2009 06:20:39 +0000

@PB – #9
suspiciously identical comment to Jason #3??
Anyway, see my comment, #4

By: Michael

Michael — Wed, 27 May 2009 06:14:44 +0000

FYI – the geo_ip_blocks table has 4M rows.