{"id":1293,"date":"2009-09-14T18:16:11","date_gmt":"2009-09-14T16:16:11","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=1293"},"modified":"2009-09-14T18:16:11","modified_gmt":"2009-09-14T16:16:11","slug":"sql-ranking-without-self-join","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/sql-ranking-without-self-join","title":{"rendered":"SQL: Ranking without self join"},"content":{"rendered":"<p>The common way of solving the classic SQL problem of ranking, involves a\u00a0 self join. I wish to present a different solution, which only iterates the table once, and provides the same output.<\/p>\n<h4>The ranking problem<\/h4>\n<p>Given a table with names and scores (e.g. students exams scores), add rank for each row, such that the rank identifies her position among other rows. Rows with identical scores should receive the same rank (e.g. both contenders got the silver medal).<\/p>\n<p>Consider the following table (download <a href=\"http:\/\/code.openark.org\/blog\/wp-content\/uploads\/2009\/09\/score.sql\">score.sql<\/a>):<\/p>\n<blockquote>\n<pre>mysql&gt; select * from score;\r\n+----------+--------------+-------+\r\n| score_id | student_name | score |\r\n+----------+--------------+-------+\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1 | Wallace\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 95 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2 | Gromit\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 97 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 3 | Shaun\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 85 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 4 | McGraw\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 5 | Preston\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\r\n+----------+--------------+-------+\r\n5 rows in set (0.00 sec)<\/pre>\n<\/blockquote>\n<p>We wish to present ranks in some way similar to:<\/p>\n<blockquote>\n<pre>+----------+--------------+-------+------+\r\n| score_id | student_name | score | rank |\r\n+----------+--------------+-------+------+\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2 | Gromit\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 97 |\u00a0\u00a0\u00a0 1 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1 | Wallace\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 95 |\u00a0\u00a0\u00a0 2 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 4 | McGraw\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0 3 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 5 | Preston\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0 3 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 3 | Shaun\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 85 |\u00a0\u00a0\u00a0 4 |\r\n+----------+--------------+-------+------+<\/pre>\n<\/blockquote>\n<p><!--more-->Note that McGraw and Preston got same scores, therefore both share 3rd rank, whereas Shaun gets ranked #4.<\/p>\n<h4>The common solution<\/h4>\n<p>Following is the SQL for the common solution:<\/p>\n<blockquote>\n<pre>SELECT\r\n  s1.score_id, s1.student_name, s1.score, COUNT(DISTINCT s2.score) AS rank\r\nFROM\r\n  score s1 JOIN score s2 ON (s1.score &lt;= s2.score)\r\nGROUP BY s1.score_id\r\n;\r\n+----------+--------------+-------+------+\r\n| score_id | student_name | score | rank |\r\n+----------+--------------+-------+------+\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1 | Wallace\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 95 |\u00a0\u00a0\u00a0 2 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2 | Gromit\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 97 |\u00a0\u00a0\u00a0 1 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 3 | Shaun\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 85 |\u00a0\u00a0\u00a0 4 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 4 | McGraw\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0 3 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 5 | Preston\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0 3 |\r\n+----------+--------------+-------+------+<\/pre>\n<\/blockquote>\n<p>(The above can by ORDERed at will, more on this later)<\/p>\n<p>What I&#8217;m suggesting is that this self join is an overkill. It recalculates over and over what we knew a second before: to get the Preston&#8217;s rank, we need to count how many students got score &gt;=92. But when we need to find Shaun&#8217;s rank, we re-iterate these, and in addition count those with grades 85..91. We&#8217;re reading, re-reading, then re-re-reading (you get the point) the same data all over again. It&#8217;s a waste of energy.<\/p>\n<h4>Offering a new solution<\/h4>\n<p>I propose a simpler solution: do a one-time sorting of rows according to score (descending). The first row in the sorted set should obviously get the score &#8220;1&#8221;. Now we iterate the rows one by one, and keep a <em>rank<\/em> variable. Whenever the score remains the same, we just keep on iterating. Whenever the score changes (it can only change in the direction of &#8220;downwards&#8221;, since we sorted by score. descending), we increment the rank.<\/p>\n<p>Actually, the above explanation makes it sound as if we do this with multiple steps. This is not so. We do this all in one step:<\/p>\n<blockquote>\n<pre>SELECT\r\n  score_id, student_name, score,\r\n  @prev := @curr,\r\n  @curr := score,\r\n  @rank := IF(@prev = @curr, @rank, @rank+1) AS rank\r\nFROM\r\n  score,\r\n  (SELECT @curr := null, @prev := null, @rank := 0) sel1\r\nORDER BY score DESC\r\n;\r\n+----------+--------------+-------+----------------+----------------+------+\r\n| score_id | student_name | score | @prev := @curr | @curr := score | rank |\r\n+----------+--------------+-------+----------------+----------------+------+\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2 | Gromit\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 97 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 NULL |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 97 |\u00a0\u00a0\u00a0 1 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1 | Wallace\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 95 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 97 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 95 |\u00a0\u00a0\u00a0 2 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 4 | McGraw\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 95 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0 3 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 5 | Preston\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0 3 |\r\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 3 | Shaun\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 85 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 92 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 85 |\u00a0\u00a0\u00a0 4 |\r\n+----------+--------------+-------+----------------+----------------+------+<\/pre>\n<\/blockquote>\n<h4>Execution plan comparison<\/h4>\n<p>Do we have an index on the <strong>score<\/strong> column?<\/p>\n<p>If not, I clearly win. The self join (when there&#8217;s more than mere 5 rows, of course) will make for repeated full table scans, thereby making for O(n\u00b2).<\/p>\n<blockquote>\n<pre>+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+\r\n| id | select_type | table | type | possible_keys | key\u00a0 | key_len | ref\u00a0 | rows | Extra\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+\r\n|\u00a0 1 | SIMPLE\u00a0\u00a0\u00a0\u00a0\u00a0 | s1\u00a0\u00a0\u00a0 | ALL\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | NULL\u00a0\u00a0\u00a0 | NULL |\u00a0\u00a0\u00a0 5 | Using temporary; Using filesort |\r\n|\u00a0 1 | SIMPLE\u00a0\u00a0\u00a0\u00a0\u00a0 | s2\u00a0\u00a0\u00a0 | ALL\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | NULL\u00a0\u00a0\u00a0 | NULL |\u00a0\u00a0\u00a0 5 | Using where\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+<\/pre>\n<\/blockquote>\n<p>Whereas the suggested solution requires one filesort. This still means table data can be re-read, but significantly less so: it only takes O(n*log(n)), where the log(n) part is usually very small (and it all depend on <em>sort_buffer_size<\/em>).<\/p>\n<blockquote>\n<pre>+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+\r\n| id | select_type | table\u00a0\u00a0\u00a0\u00a0\u00a0 | type\u00a0\u00a0 | possible_keys | key\u00a0 | key_len | ref\u00a0 | rows | Extra\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+\r\n|\u00a0 1 | PRIMARY\u00a0\u00a0\u00a0\u00a0 | &lt;derived2&gt; | system | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | NULL\u00a0\u00a0\u00a0 | NULL |\u00a0\u00a0\u00a0 1 | Using filesort |\r\n|\u00a0 1 | PRIMARY\u00a0\u00a0\u00a0\u00a0 | score\u00a0\u00a0\u00a0\u00a0\u00a0 | ALL\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | NULL\u00a0\u00a0\u00a0 | NULL |\u00a0\u00a0\u00a0 5 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n|\u00a0 2 | DERIVED\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | NULL\u00a0\u00a0\u00a0 | NULL | NULL | No tables used |\r\n+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+<\/pre>\n<\/blockquote>\n<p>Testing on the somewhat larger <em><a href=\"http:\/\/dev.mysql.com\/doc\/sakila\/en\/sakila.html\">sakila<\/a>.film<\/em> table (1000 rows is all) on my laptop, it takes 47 seconds for the common query to complete, 0.01 seconds to presented solution (repeatedly, no cache issues).<\/p>\n<p>What happens when we do have an index? Again, I win. Testing on <em>sakila.film<\/em> now, having added an index on <strong>location<\/strong> column:<\/p>\n<blockquote>\n<pre>+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+\r\n| id | select_type | table | type\u00a0 | possible_keys | key\u00a0\u00a0\u00a0 | key_len | ref\u00a0 | rows | Extra\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+\r\n|\u00a0 1 | SIMPLE\u00a0\u00a0\u00a0\u00a0\u00a0 | s2\u00a0\u00a0\u00a0 | index | length\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | length | 3\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | 1140 | Using index; Using temporary; Using filesort |\r\n|\u00a0 1 | SIMPLE\u00a0\u00a0\u00a0\u00a0\u00a0 | s1\u00a0\u00a0\u00a0 | ALL\u00a0\u00a0 | length\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0 | NULL | 1140 | Using where\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+<\/pre>\n<\/blockquote>\n<p>The above still needs to do index scan. The <strong>GROUP BY<\/strong> requires either sorting or utilizing the <strong>PRIMARY KEY<\/strong>, and the execution plan with reversed tables ordering does not improve.<\/p>\n<p>Presented solution utilized index for a single pass, with O(n) complexity:<\/p>\n<blockquote>\n<pre>+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+\r\n| id | select_type | table\u00a0\u00a0\u00a0\u00a0\u00a0 | type\u00a0\u00a0 | possible_keys | key\u00a0\u00a0\u00a0 | key_len | ref\u00a0 | rows | Extra\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+\r\n|\u00a0 1 | PRIMARY\u00a0\u00a0\u00a0\u00a0 | &lt;derived2&gt; | system | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0 | NULL |\u00a0\u00a0\u00a0 1 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n|\u00a0 1 | PRIMARY\u00a0\u00a0\u00a0\u00a0 | film\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | index\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | length | 3\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL | 1140 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |\r\n|\u00a0 2 | DERIVED\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | NULL\u00a0\u00a0 | NULL\u00a0\u00a0\u00a0 | NULL | NULL | No tables used |\r\n+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+<\/pre>\n<\/blockquote>\n<p>This is done with a single index pass. Again, on my laptop, I get ~41 seconds for 1st query, 0.01 seconds for the proposed solution.<\/p>\n<p>I also argue that in addition, the results would often be required by order of rank, in which case the common solution must eventually sort anyhow.<\/p>\n<h4>Conclusion<\/h4>\n<p>To be honest, I&#8217;ve seen the self-join solution in so many places: textbooks, training material, online tutorials&#8230; Maybe it&#8217;s just a silly exercise, perhaps not your daily real-world task; but it&#8217;s one of those classic SQL problems. The so-often-repeated common solution is ANSI SQL, for sure, but at what cost?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The common way of solving the classic SQL problem of ranking, involves a\u00a0 self join. I wish to present a different solution, which only iterates the table once, and provides the same output. The ranking problem Given a table with names and scores (e.g. students exams scores), add rank for each row, such that the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[15,21],"class_list":["post-1293","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-execution-plan","tag-sql"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-kR","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1293","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=1293"}],"version-history":[{"count":20,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1293\/revisions"}],"predecessor-version":[{"id":1314,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1293\/revisions\/1314"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=1293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=1293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=1293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}