SQL: Ranking without self join

September 14, 2009

The common way of solving the classic SQL problem of ranking, involves a  self join. I wish to present a different solution, which only iterates the table once, and provides the same output.

The ranking problem

Given a table with names and scores (e.g. students exams scores), add rank for each row, such that the rank identifies her position among other rows. Rows with identical scores should receive the same rank (e.g. both contenders got the silver medal).

Consider the following table (download score.sql):

mysql> select * from score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
|        1 | Wallace      |    95 |
|        2 | Gromit       |    97 |
|        3 | Shaun        |    85 |
|        4 | McGraw       |    92 |
|        5 | Preston      |    92 |
+----------+--------------+-------+
5 rows in set (0.00 sec)

We wish to present ranks in some way similar to:

+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        2 | Gromit       |    97 |    1 |
|        1 | Wallace      |    95 |    2 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
|        3 | Shaun        |    85 |    4 |
+----------+--------------+-------+------+

Note that McGraw and Preston got same scores, therefore both share 3rd rank, whereas Shaun gets ranked #4.

The common solution

Following is the SQL for the common solution:

SELECT
  s1.score_id, s1.student_name, s1.score, COUNT(DISTINCT s2.score) AS rank
FROM
  score s1 JOIN score s2 ON (s1.score <= s2.score)
GROUP BY s1.score_id
;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        1 | Wallace      |    95 |    2 |
|        2 | Gromit       |    97 |    1 |
|        3 | Shaun        |    85 |    4 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
+----------+--------------+-------+------+

(The above can by ORDERed at will, more on this later)

What I'm suggesting is that this self join is an overkill. It recalculates over and over what we knew a second before: to get the Preston's rank, we need to count how many students got score >=92. But when we need to find Shaun's rank, we re-iterate these, and in addition count those with grades 85..91. We're reading, re-reading, then re-re-reading (you get the point) the same data all over again. It's a waste of energy.

Offering a new solution

I propose a simpler solution: do a one-time sorting of rows according to score (descending). The first row in the sorted set should obviously get the score "1". Now we iterate the rows one by one, and keep a rank variable. Whenever the score remains the same, we just keep on iterating. Whenever the score changes (it can only change in the direction of "downwards", since we sorted by score. descending), we increment the rank.

Actually, the above explanation makes it sound as if we do this with multiple steps. This is not so. We do this all in one step:

SELECT
  score_id, student_name, score,
  @prev := @curr,
  @curr := score,
  @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
FROM
  score,
  (SELECT @curr := null, @prev := null, @rank := 0) sel1
ORDER BY score DESC
;
+----------+--------------+-------+----------------+----------------+------+
| score_id | student_name | score | @prev := @curr | @curr := score | rank |
+----------+--------------+-------+----------------+----------------+------+
|        2 | Gromit       |    97 |           NULL |             97 |    1 |
|        1 | Wallace      |    95 |             97 |             95 |    2 |
|        4 | McGraw       |    92 |             95 |             92 |    3 |
|        5 | Preston      |    92 |             92 |             92 |    3 |
|        3 | Shaun        |    85 |             92 |             85 |    4 |
+----------+--------------+-------+----------------+----------------+------+

Execution plan comparison

Do we have an index on the score column?

If not, I clearly win. The self join (when there's more than mere 5 rows, of course) will make for repeated full table scans, thereby making for O(n²).

+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | SIMPLE      | s1    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | s2    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where                     |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+

Whereas the suggested solution requires one filesort. This still means table data can be re-read, but significantly less so: it only takes O(n*log(n)), where the log(n) part is usually very small (and it all depend on sort_buffer_size).

+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL | NULL    | NULL |    1 | Using filesort |
|  1 | PRIMARY     | score      | ALL    | NULL          | NULL | NULL    | NULL |    5 |                |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+

Testing on the somewhat larger sakila.film table (1000 rows is all) on my laptop, it takes 47 seconds for the common query to complete, 0.01 seconds to presented solution (repeatedly, no cache issues).

What happens when we do have an index? Again, I win. Testing on sakila.film now, having added an index on location column:

+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra                                        |
+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | s2    | index | length        | length | 3       | NULL | 1140 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | s1    | ALL   | length        | NULL   | NULL    | NULL | 1140 | Using where                                  |
+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+

The above still needs to do index scan. The GROUP BY requires either sorting or utilizing the PRIMARY KEY, and the execution plan with reversed tables ordering does not improve.

Presented solution utilized index for a single pass, with O(n) complexity:

+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+
| id | select_type | table      | type   | possible_keys | key    | key_len | ref  | rows | Extra          |
+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL   | NULL    | NULL |    1 |                |
|  1 | PRIMARY     | film       | index  | NULL          | length | 3       | NULL | 1140 |                |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL   | NULL    | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+

This is done with a single index pass. Again, on my laptop, I get ~41 seconds for 1st query, 0.01 seconds for the proposed solution.

I also argue that in addition, the results would often be required by order of rank, in which case the common solution must eventually sort anyhow.

Conclusion

To be honest, I've seen the self-join solution in so many places: textbooks, training material, online tutorials... Maybe it's just a silly exercise, perhaps not your daily real-world task; but it's one of those classic SQL problems. The so-often-repeated common solution is ANSI SQL, for sure, but at what cost?

tags: ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

33 Comments to "SQL: Ranking without self join"

  1. GSnyder wrote:

    This is very interesting, but I wonder if it's technically correct. According to the MySQL 5.1 manual:

    "The order of evaluation for user variables is undefined and may change based on the elements contained within a given query. In SELECT @a, @a := @a+1 ..., you might think that MySQL will evaluate @a first and then do an assignment second, but changing the query (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may change the order of evaluation...The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed."

    Reference: http://dev.mysql.com/doc/refman/5.1/en/user-variables.html

    Also, the cross join to set the initial values of the variables seems unnecessarily complex - does this really have a defined behavior? You could just precede the SELECT with a separate SET to initialize the variables. True, it's no longer a single step, but in what context would this really matter?

  2. Toby wrote:

    Many self joins are fundamentally quadratic, so these results should not be very surprising. What would be interesting would be an optimiser that could convert the self join expression into a less-than-quadratic plan.

  3. Jay Pipes wrote:

    :) Now try your trick when you need a GROUP BY ;)

  4. Rob Wultsch wrote:

    Now if MySQL had windowing functions...

  5. shlomi wrote:

    @GSnyder

    Thanks for the reference. A problem indeed.
    As for the cross join: it actually does have a defined behavior: the evaluation of "tables" happens first. which means the assignment of rank := 0, for example, happens before any row is iterated,

  6. shlomi wrote:

    @Jay,
    Am prepared to abandon the trick altogether due to the undefined variables order behavior. Why, oh why?

  7. Roland Bouman wrote:

    Hi!

    it is a nice trick, and it reminds me of what I did about a yearr ago on calculating Percentiles:

    http://rpbouman.blogspot.com/2008/07/calculating-percentiles-with-mysql.html

    Unfortunately, it is indeed unreliable.

    But there is another way, that is still better than the cartesian self join. This alternative solution relies on GROUP_CONCAT and can be written using either a scalar subquery or a derived table (which still is kind of a self join in this case)

    You can read more about it here:
    http://rpbouman.blogspot.com/2009/09/mysql-another-ranking-trick.html

  8. Giuseppe Maxia wrote:

    Shlomi,
    Nice post. Jay wrote a nice article about a similar matter some time ago.
    http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html

    Cheers

    Giuseppe

  9. shlomi wrote:

    @Roland:
    Thanks, and have commented in your post as well!

    @Giuseppe,
    Once again, thanks for the reference!

  10. On user variables evaluation order | code.openark.org wrote:

    [...] SQL: Ranking without self join [...]

  11. Log Buffer #162: a Carnival of the Vanities for DBAs | Pythian Group Blog wrote:

    [...] Noach shared some of his thoughts on ranking without self join in SQL, which he introduces thus: “The common way of solving the classic SQL problem of ranking, [...]

  12. Log Buffer #162: a Carnival of the Vanities for DBAs « PlanetMysql.ru – информация о СУБД MySQL wrote:

    [...] Noach shared some of his thoughts on ranking without self join in SQL, which he introduces thus: “The common way of solving the classic SQL problem of ranking, [...]

  13. shlomi wrote:

    Yikes!!

    Browsing through High Performance MySQL (2nd Ed.), I suddenly notice this has all been discussed at length, at the end of chapter 4.
    Seems like I've mostly repeated things that have been said there; I apparently offer little new.

    Shlomi

  14. PC wrote:

    I am using SQLite and I seem to need to use the Self-Join method. If you had a "rank" column, is there a way to set this field for all rows with one statement?

    For example:

    UPDATE score SET rank = (SELECT COUNT(DISTINCT s2.score) AS rank
    FROM score s1 JOIN score s2 ON (s1.score <= s2.score)
    GROUP BY s1.score_id)

    does not work as it always returns the same value.

  15. shlomi wrote:

    @PC

    Does SQLite support the following syntax?

    UPDATE t1 USING t1, t2 WHERE t1.c1 = t2.c1 ...

    If so, that's the easy solution, I believe

  16. Roland Bouman wrote:

    @shlomi: going by http://www.sqlite.org/lang_update.html, it seems this is not supported (it is after all non-standard SQL)

    @PC: I think your statement should read:

    UPDATE score s
    SET rank = (
    SELECT COUNT(DISTINCT r.score)
    FROM score r
    WHERE s.score <= r.score
    )

  17. sergio wrote:

    Hello I read your post SQL: Ranking without self join .

    This was great but I have a question for you; lets say the table held an added column called test_id denoting seperate tests scores; is it possible to have your solution rank the scores based on the different tests; so it would rank the top scores in one test_id and then rank scores for a different test_id where you would have over lapping ranks in the table for different tests.

    Sergio

  18. shlomi wrote:

    @sergio
    Yes, it is possible; you would need to maintain two sets of variables ; the primary being the test_id, the secondary being for the score.
    Whenever the test id changes, you reset the score variables.
    you first ORDER BY test_id DESC, SCORE DESC.

  19. sergio wrote:

    Hello again, I have tried what you said but it doesnt seem to work, can you provide a little more insigth really getting fustrated. Also as an added question to add to the previous question on top of the test_ids for multiple tests stored in the same table, what if another column was added with bonus_score which rank is determined first by the regular score and if equal, then the bonus score is used to determine the ranks? I really appricate any insight you can provide as this is very frustrating to me.

  20. Toby wrote:

    Sergio,
    You could always just use the self join approach, which makes your scenarios simple.

  21. Roland Bouman wrote:

    Hi Sergio,

    post your table definition, some 20 rows of sample data that illustrate the desired result.

  22. Pau Sanchez wrote:

    Fantastic solution!

    I was looking for a solution and it come up after looking to this post ;)

    My problem was having a sort_index field used to sort some elements in a particular order.

    The thing is that the indexes were not consecutive, and that caused some trouble for some tasks I'm not going to talk about.

    Thanks to your solution I came up with:

    SET @pos := 0;
    UPDATE sort_example
    SET sort_index = (SELECT @pos := @pos+1)
    ORDER BY sort_index ASC

    So if before running this code the sort_indexes of several rows were:
    1, 32, 34, 58, 99, 1028, 9982

    After running the update they turned into:
    1, 2, 3, 4, 5, 6, 7

    Just wanted to share, maybe it can be useful to someone else.

    Thanks!

  23. shlomi wrote:

    @Pau

    Very nice! Thanks for sharing

  24. MySQL: Another Ranking trick « JZ Talk Blogger wrote:

    [...] Another Ranking trick I just read SQL: Ranking without self join, in which Shlomi Noach shares a nice MySQL-specific trick based on user-defined variables to [...]

  25. George wrote:

    Hi Shlomi,

    This piece of gold has already helped me immensely, thanks.

    As per #18, how would I reset the other variable? I cannot seem to find anything on mysql.com

  26. shlomi wrote:

    George,

    Reset as in @c := 0, nothing fancy...

  27. links for 2010-06-25 | Digitalistic - Mashup or die trying wrote:

    [...] SQL: Ranking without self join | code.openark.org (tags: sql join ranking optimization mysql) [...]

  28. Mark Plutowski wrote:

    @shlomi: thanks for the neat trick of using cross join to initialize the variables.

    @GSnyder : this matters when using a tool (e.g., BI or reporting tool) that only allows entering a single SQL statement.

    @Jay: as you say: "Now try your trick when you need a GROUP BY" -- Indeedy, unfort'y this is the showstopper for me too. I can't get it to work with a GROUP BY.

    But hey, I still got one neat trick out of this, so thanks again for that Shlomi !

  29. shlomi wrote:

    @Mark,

    Was taught that a couple of years ago here:
    http://code.openark.org/blog/mysql/dynamic-sequencing-with-a-single-query#comment-63

  30. bigcode wrote:

    I don't have an example of doing a rank without the self join but my example does allow for grouping.
    http://bigcode.wordpress.com/2010/12/11/5/

  31. Subrata wrote:

    Try this one

    SELECT
    s1.score_id, s1.student_name, s1.score, COUNT(DISTINCT s2.score) AS rank
    FROM
    score s1 JOIN score s2 ON (s1.score <= s2.score)
    GROUP BY s1.score_id ORDER BY rank

  32. Nandhagopal wrote:

    I am new to the world of sql, so I apologise if my doubt sounds childish... In your examples, if two people are tied for the second spot then the one following them is 3rd. In actual conventional ranking systems, we place that fellow 4th. So, how would you change your query to incorporate it such that if m people are tied for the n'th rank, then the next person after them must be ranked (m+n).

  33. shlomi wrote:

    @Nandhagopal,
    You would use something like this:
    @step := IF(@prev = @curr, @step+1, 1) AS step
    @rank := IF(@prev = @curr, @rank, @rank+@step) AS rank
    ...
    SELECT @step := 1

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org