{"id":1330,"date":"2009-09-25T04:02:08","date_gmt":"2009-09-25T02:02:08","guid":{"rendered":"http:\/\/code.openark.org\/blog\/?p=1330"},"modified":"2009-09-25T15:20:52","modified_gmt":"2009-09-25T13:20:52","slug":"sql-ranking-without-self-join-revisited","status":"publish","type":"post","link":"https:\/\/code.openark.org\/blog\/mysql\/sql-ranking-without-self-join-revisited","title":{"rendered":"SQL: ranking without self join, revisited"},"content":{"rendered":"<p>This post follows <a title=\"Link to SQL: Ranking without self join\" rel=\"bookmark\" href=\"http:\/\/code.openark.org\/blog\/mysql\/sql-ranking-without-self-join\">SQL: Ranking without self join<\/a> and <a title=\"Link to On user variables evaluation order\" rel=\"bookmark\" href=\"http:\/\/code.openark.org\/blog\/mysql\/on-user-variables-evaluation-order\">On user variables evaluation order<\/a>. I wish to share some insights with regard to user variables evaluation, as well as provide <em>yet another<\/em> ranking solution, which attempts to overcome the uncertainty factor with user variables.<\/p>\n<p>There will be hand waving in this post (albeit empirical hand waving). Stop here if you don&#8217;t like hand waving. Continue if you feel curious or wish to contradict my assumptions.<\/p>\n<h4>Recap<\/h4>\n<p>The order of evaluation of user variables is undefined. The documentation has some contradicting example (bug <a href=\"http:\/\/bugs.mysql.com\/bug.php?id=47514\">47514<\/a>), but states that variables should not be read and assigned in different parts of the same statement (just <em>what is<\/em> a different part? Bug <a href=\"http:\/\/bugs.mysql.com\/bug.php?id=47516\">47516<\/a>).<\/p>\n<h4>Looking for a solution<\/h4>\n<p>There doesn&#8217;t seem to be a problem with reading and assigning variables in the very same part of the statement. For example, <strong>SELECT @a := @a+1<\/strong>&#8230; is exactly such a case.<\/p>\n<p><!--more-->So I was looking for means to setup all the variables in one single &#8220;part&#8221; of the statement, which is to say, a single column. I&#8217;ve found several solutions, here&#8217;s one:<\/p>\n<h4>COALESCE()<\/h4>\n<p>The <strong>COALESCE()<\/strong> function accepts any number of parameters (&gt;=1), and returns the first non-NULL parameter. Tests show that it iterates the parameters and <em>stops iterating<\/em> at the first non-NULL parameter. So, parameter #3 is not even evaluated if parameter #2 is proven to be non-NULL.<\/p>\n<p>This makes for a very interesting property: there is a known order of evaluation of parameters in this function.<\/p>\n<p>So, for example:<\/p>\n<blockquote>\n<pre>SET @a := 0, @b := 0;\r\nSELECT COALESCE(@a := @a+1, @b := @a+3);\r\n+----------------------------------+\r\n| COALESCE(@a := @a+1, @b := @a+3) |\r\n+----------------------------------+\r\n|                                1 |\r\n+----------------------------------+\r\n\r\nSELECT @a, @b;\r\n+------+------+\r\n| @a   | @b   |\r\n+------+------+\r\n| 1    | 0    |\r\n+------+------+<\/pre>\n<\/blockquote>\n<p>We know that <strong>@a+1<\/strong> is first evaluated, and only if it turns to be NULL, <strong>@a+3<\/strong> is evaluated (which isn&#8217;t the case here).<\/p>\n<p><em>PS: <strong>CASE<\/strong> statements also work this way, but <strong>COALESCE()<\/strong> is more compact, so I&#8217;ll use it instead.<\/em><\/p>\n<p>We still can&#8217;t be sure that the <em>assignment<\/em> takes place at that time. For example, is it possible that <strong>@a+1<\/strong> is first evaluated, then (assuming it&#8217;s NULL) <strong>@a+3<\/strong> is evaluated, and only then <strong>@a<\/strong> and <strong>@b<\/strong> get the assignment?<\/p>\n<p>I find this extremely unlikely. That would mean, code-wise, creating a temporary map which caches every internal, intermediate result, until column value is finally evaluated, only then to be iterated and mapped to the variables. To give an extreme example, look at:<\/p>\n<blockquote>\n<pre>SELECT @a := (@b := 3 + (@c := @c + 2*(@d := SIN(id)))) FROM world.City<\/pre>\n<\/blockquote>\n<p>Would <strong>SIN(id)<\/strong>, <strong>2*SIN(id)<\/strong>, <strong>@c + 2*SIN(id)<\/strong>, <strong>3 + (@c + 2*SIN(id))<\/strong> all be cached even while the result is being evaluated? Not only is this unlikely, user variables common usage indicates the opposite. I think it&#8217;s safe to say that <strong>@d<\/strong> is first assigned, then <strong>@c<\/strong>, then <strong>@b<\/strong>, and <strong>@a<\/strong> being the last.<\/p>\n<h4>ORDER BY<\/h4>\n<p>Yet another issue is: if q query has an <strong>ORDER BY<\/strong>, how can I be certain that user variables are iterated <em>after<\/em> ordering takes place, and not <em>during<\/em> sort merge passes?<\/p>\n<p>Apart from:<\/p>\n<ul>\n<li> Promising I&#8217;ll try and read the source code<\/li>\n<li>Wait for qualified answers from the programmers<\/li>\n<li>Empirically testing this is not the case<\/li>\n<\/ul>\n<p>Let me have one last hand waving: if that were the case, we wouldn&#8217;t be using user variables anytime, anywhere; everything would be too fragile.<\/p>\n<h4>Ranking without self join<\/h4>\n<p><a href=\"http:\/\/rpbouman.blogspot.com\/\">Roland Bouman<\/a> presented <a href=\"http:\/\/rpbouman.blogspot.com\/2009\/09\/mysql-another-ranking-trick.html\">another ranking trick<\/a>. A very cool one! Why not settle with that solution?<\/p>\n<p>The thing which bothers me with it is the memory required to create the concatenated score values. We need to build an in-memory string representation of all scores. So the memory requirements for his algorithm are O(n), whereas I strive to use an O(1) amount of memory. The problem with O(n) is that it may not work well (or not at all) for very large datasets. Even for tables indexed by the score column, that amount of memory must be allocated. Then parsed; and then parsed again; and again; and each time the parsing should take longer, since each time we&#8217;re looking at a later token.<\/p>\n<p>So I wish to present yet another solution, which relies on the <strong>COALESCE()<\/strong> properties discussed above. I&#8217;ll force all <strong>COALESCE()<\/strong> parameters but last one to be NULL, thereby forcing <strong>COALESCE()<\/strong> to keep iterating and evaluate all parameters, in an expected order. Again, I&#8217;ll be using a simple score table (download <a href=\"..\/wp-content\/uploads\/2009\/09\/score.sql\">score.sql<\/a>):<\/p>\n<blockquote>\n<pre>SELECT\r\n  score_id, student_name, score,\r\n  @rank := COALESCE(\r\n    NULL + (@prev := @curr),\r\n    NULL + (@curr := score),\r\n    IF(@prev = @curr, @rank, @rank+1)\r\n  ) AS rank\r\nFROM\r\n score,\r\n (SELECT @curr := null, @prev := null, @rank := 0) sel1\r\nORDER BY score DESC\r\n;\r\n+----------+--------------+-------+------+\r\n| score_id | student_name | score | rank |\r\n+----------+--------------+-------+------+\r\n|        2 | Gromit       |    97 |    1 |\r\n|        1 | Wallace      |    95 |    2 |\r\n|        4 | McGraw       |    92 |    3 |\r\n|        5 | Preston      |    92 |    3 |\r\n|        3 | Shaun        |    85 |    4 |\r\n+----------+--------------+-------+------+<\/pre>\n<\/blockquote>\n<p>I&#8217;ve tried the above with several variations, with <strong>GROUP BY<\/strong> on <em>score_id<\/em>; on <em>score_id, other columns<\/em>; with <strong>GROUP BY<\/strong> mixed with <strong>ORDER BY<\/strong>; to always reach the same result set.<\/p>\n<div id=\"_mcePaste\" style=\"overflow: hidden; position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px;\">SET @a := 0, @b := 0;<br \/>\nSELECT COALESCE(@a := @a+1, @b := @b+1);<br \/>\n+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+<br \/>\n| COALESCE(@a := @a+1, @b := @b+1) |<br \/>\n+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+<br \/>\n|\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1 |<br \/>\n+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-+<\/p>\n<p>SELECT @a, @b;<br \/>\n+&#8212;&#8212;+&#8212;&#8212;+<br \/>\n| @a\u00a0\u00a0 | @b\u00a0\u00a0 |<br \/>\n+&#8212;&#8212;+&#8212;&#8212;+<br \/>\n| 1\u00a0\u00a0\u00a0 | 0\u00a0\u00a0\u00a0 |<br \/>\n+&#8212;&#8212;+&#8212;&#8212;+<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>This post follows SQL: Ranking without self join and On user variables evaluation order. I wish to share some insights with regard to user variables evaluation, as well as provide yet another ranking solution, which attempts to overcome the uncertainty factor with user variables. There will be hand waving in this post (albeit empirical hand [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"enabled":false},"version":2}},"categories":[5],"tags":[21],"class_list":["post-1330","post","type-post","status-publish","format-standard","hentry","category-mysql","tag-sql"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p2bZZp-ls","_links":{"self":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1330","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/comments?post=1330"}],"version-history":[{"count":15,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1330\/revisions"}],"predecessor-version":[{"id":1345,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/posts\/1330\/revisions\/1345"}],"wp:attachment":[{"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/media?parent=1330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/categories?post=1330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/code.openark.org\/blog\/wp-json\/wp\/v2\/tags?post=1330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}