On user variables evaluation order

There is something very unclear about what is defined and is undefined with regard to the order by which user variables are evaluated, even within the MySQL documentation itself.

I wish to present some examples and draw a conclusion. Since I will following state there’s missing information, I would greatly appreciate any educated comments.

The trivial “reordering problem” case

Let’s look at the following query:

mysql> SET @a := 0;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @b := 0;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT @a:=@b+1, @b:=@a FROM mysql.user;
+----------+--------+
| @a:=@b+1 | @b:=@a |
+----------+--------+
|        1 |      1 |
|        2 |      2 |
|        3 |      3 |
|        4 |      4 |
|        5 |      5 |
+----------+--------+
5 rows in set (0.04 sec)

According to the docs, I can’t rely on the fact that the @b:=@a part reads the value of @a after is has incremented, So I may eventually end up with:

+----------+--------+
| @a:=@b+1 | @b:=@a |
+----------+--------+
|        1 |      0 |
|        2 |      1 |
|        3 |      2 |
|        4 |      3 |
|        5 |      4 |
+----------+--------+

I wish to follow the documentation and see about other cases.

The trivial “trivial” case

Quoting the MySQL documentation (8.4. User-Defined Variables):

…The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. …

How does the above quote (focus on bold text) from the manual apply to the trivial:

SET @a := 0;
SELECT *, @a := @a+1 FROM mysql.user;

We’ve just used/read the value of @a, and have assigned it a new value (@a+1) at the same statement. What’s an other part? Is this considered to be the same part just because it’s the same column in the result set?

Let’s complicate this further:

A complicated case

Quoting the manual again, we see:

mysql> SET @t1=0, @t2=0, @t3=0;
mysql> SELECT @t1:=(@t2:=1)+@t3:=4,@t1,@t2,@t3;
+----------------------+------+------+------+
| @t1:=(@t2:=1)+@t3:=4 | @t1  | @t2  | @t3  |
+----------------------+------+------+------+
|                    5 |    5 |    1 |    4 |
+----------------------+------+------+------+

What’s going on here? Haven’t we been told not to assign and use the same variable in the same statement? @t1 has been assigned in the first part, then has been used in the second (according to the documentation itself, we could have got ‘0’ for @t1, second column). But I’m not keen on ranting, look further:

Looking at the first part of the statement: @t1:=(@t2:=1)+@t3:=4, what promises do I get here?

  • Does the use of brackets ensure that @t2 is evaluated before @t1?
  • Does the use of brackets ensure that @t3 is evaluated before @t1?
  • Does the order of expression tokens ensure that @t2 is evaluated before @t3 is assigned?

I think the answer to the above is ‘yes’ to all. But I can’t verify this using the text. The reasons I think all answer ‘yes’ are:

  • This all happens in only one part of the statement (ignoring the use of @t1, @t2, @t3 later on).
  • Experience verifies it is so.
  • I don’t see how ORDER BY, GROUP BY, HAVING (choosing examples from the docs) will change this: my understanding is that it all relates to column computation, and here this is a single column.

Drawing a conclusion

If the above is correct, and within the same part of the statement the use of brackets can dictate ordering of assignments, then we can workaround many problems arising from the danger of assigning and reading variables in the same statement.

If, in addition, the ordering of the tokens in the expression guarantees execution order within that part of statement, then the workaround does not need to be too ugly.

Using such expressions as COALESCE(@a := 1, @b := @a+1, …) or otherwise using IF/CASE/other tricks, we can perform as many assignments as we wish within the same part of the statement.

Looking back at SQL: Ranking without self join, the solution can be fixed, if the answers to the above are correct.

The use of user variables is not a child’s play; it can help solving problems with great speed, and small memory footprint. I think this issue is a little too obscurred. I call for clarifications.

5 thoughts on “On user variables evaluation order

  1. Hi Shlomi,

    I think you raise a few valid points. For instance I think that it is indeed correct to assume that one could force the order of evaluation with parenthesis if it is all contained within the same expression.

    However, the way I see it, this will only force the per-row evaluation order. Major problems can still arise when your expression relies on assignments that happened during the processing of some row other than the current row. Since the semantics of the SELECT statement is to deliver a set of rows, there is now way to force the order of row processing before constructing the resultset.

    A simple example:

    SET @a:=0, @b:=0;
    SELECT IF(@a:=@b+1, @b:=@a, NULL)
    FROM sakila.film_text;

    In this case, no columns are accessed and it really does not matter in what order the rows are processed. Most likely, they are scanned in the order that is convenient for the underlying storage engine. So this is safe.

    Now let’s add a column:

    SET @a:=0, @b:=0;
    SELECT film_id, IF(@a:=@b+1, @b:=@a, NULL)
    FROM sakila.film_text;

    In this case, there is a covering index for film_id, and this is likely to be used. So, rows are processed in order of the index on film_id. Still, this does not matter for the result.

    Now we change it still some:

    SET @a:=0, @b:=0;
    SELECT film_id, IF(@a:=@b+1, @b:=@a, NULL)
    FROM sakila.film
    ORDER BY title;

    The problem is, are the variables evaluated before the ORDER BY, or at the same time as the ORDER BY? I see different plans being used, and I get different results as compared to the previous query (beyond row order)

    It gets worse when we add a where. We would like the evaluations to be done after applying the WHERE, but what guarantee do we have?

    just my 2 cts.

  2. Hi Roland,

    Thanks very much for this detailed review.
    I see your point with you last example.
    This is exactly why I’m saying nothing is really defined. The docs only talk about “parts of the statement”, but have no discussion of the time at which variables are actually evaluated (e.g. before or after ORDER BY).

    For all I know, there *could be* a well defined behavior that we are simply unaware of. This is where I’d like to get more definitive answers. I realize this post may not be the best place to look for these answers, but I’m not sure where is.

    I’ll be perfectly content if told there’s a defined behavior in 1,2,3 simplest cases (e.g. full table scan, scan+ORDER BY, etc.). I can work from there to achieve results.

  3. Shlomi, I’ve actually tried to read the mysqld source code to see if I could find out how variables are used, and came to the conclusion that it’s buried deep in the optimizer and I don’t understand it well enough to draw conclusions.

  4. Baron,

    Thanks very much for this information! I suspected as much; I’ll also take the time to browse the source code.
    Meanwhile I’ve submitted a documentation bug and a feature request for more elaborate documentation.

    I made some experiments, and have reached what I think is a safe (and ugly) use of variables, where order of evaluation can be expected. But it’s only based on “empirical study”… not on sound knowledge.

    Regards

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.