MySQL joins: ON vs. USING vs. Theta-style

July 11, 2012

What is the difference between the following three syntaxes?

SELECT * FROM film JOIN film_actor ON (film.film_id = film_actor.film_id)
SELECT * FROM film JOIN film_actor USING (film_id)
SELECT * FROM film, film_actor WHERE film.film_id = film_actor.film_id

The difference is mostly syntactic sugar, but with a couple interesting notes.

To put names, the first two are called "ANSI-style" while the third is called "Theta-style".

Theta style

On the FROM clause, tables are listed as if with Cartesian products, and the WHERE clause specifies how the join should take place.

This is considered to be the "old" style. It is somewhat confusing to read. Consider the following query:

SELECT * FROM film, film_actor WHERE film.film_id = film_actor.film_id AND actor_id = 17 AND film.length > 120

The above lists films over 120 minutes in length, in which actor #17 plays. Never mind the results; what about the query? Being just one part of the WHERE clause, a one out of three elements in the AND expression, the join equation gets lost. It is difficult to find and isolate the terms which make for table joins as opposed to terms which filter out rows. In the above example it is still relatively easy to point out. How about a query with 5 tables and a 20 terms WHERE clause?

ANSI style: ON

With JOIN ... ON, one separates the join terms from the filtering terms. Rewriting the previous example:

SELECT * FROM film JOIN film_actor ON (film.film_id = film_actor.film_id) WHERE actor_id = 17 AND film.length > 120

It is quite clear now what belongs to what.

Note: the parenthesis are not strictly required in the ON clause. I personally like to use them: it makes for an even greater distinction between query parts. SQL syntax is such a mess!

ANSI style: USING

Is the special case where we join tables on columns of the same name, we can make a shortcut and use USING:

SELECT * FROM film JOIN film_actor USING (film_id) WHERE actor_id = 17 AND film.length > 120

This time the parenthesis are required (I'm not sure why the difference on that part).

This is mainly a nicety, less words to type, and a resulting prettified query. But also note a couple differences:

USING vs. ON

The following is valid:

SELECT film.title, film_id FROM film JOIN film_actor USING (film_id) WHERE actor_id = 17 AND film.length > 120;

But the following is not:

SELECT film.title, film_id FROM film JOIN film_actor ON (film.film_id = film_actor.film_id) WHERE actor_id = 17 AND film.length > 120;
ERROR 1052 (23000): Column 'film_id' in field list is ambiguous

Since USING "knows" the film_id column is shared between both tables, it doesn't mind if we ask it without specifying an exact table. It would be the same value anyway!

ON is not as smart and requires further clarifications: which table exactly do you want?

And the above is actually the result of this interesting phenomena: when using USING, the column only appears once in the result set:

SELECT * FROM film JOIN film_actor USING (film_id) WHERE actor_id = 17 AND film.length > 120 LIMIT 1\G
*************************** 1. row ***************************
             film_id: 96
               title: BREAKING HOME
         description: A Beautiful Display of a Secret Agent And a Monkey who must Battle a Sumo Wrestler in An Abandoned Mine Shaft
        release_year: 2006
         language_id: 1
original_language_id: NULL
     rental_duration: 4
         rental_rate: 2.99
              length: 169
    replacement_cost: 21.99
              rating: PG-13
    special_features: Trailers,Commentaries
         last_update: 2006-02-15 05:03:42
            actor_id: 17
         last_update: 2006-02-15 05:05:03

But joining on ON, we get this column twice:

SELECT * FROM film JOIN film_actor ON film.film_id = film_actor.film_id WHERE actor_id = 17 AND film.length > 120 LIMIT 1\G
*************************** 1. row ***************************
             film_id: 96
               title: BREAKING HOME
         description: A Beautiful Display of a Secret Agent And a Monkey who must Battle a Sumo Wrestler in An Abandoned Mine Shaft
        release_year: 2006
         language_id: 1
original_language_id: NULL
     rental_duration: 4
         rental_rate: 2.99
              length: 169
    replacement_cost: 21.99
              rating: PG-13
    special_features: Trailers,Commentaries
         last_update: 2006-02-15 05:03:42
            actor_id: 17
             film_id: 96
         last_update: 2006-02-15 05:05:03

Behind the scenes

The news is that MySQL treats all in the exact same way. With the kind help of EXPLAIN EXTENDED, we see that:

EXPLAIN EXTENDED SELECT film.title, film_id FROM film JOIN film_actor USING (film_id) WHERE actor_id = 17 AND film.length > 120\G
*************************** 1. row ***************************
...
2 rows in set, 1 warning (0.00 sec)

root@mysql-5.1.51> SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: select `sakila`.`film`.`title` AS `title`,`sakila`.`film`.`film_id` AS `film_id` 
         from `sakila`.`film` join `sakila`.`film_actor` 
         where (
                 (`sakila`.`film`.`film_id` = `sakila`.`film_actor`.`film_id`) 
                 and (`sakila`.`film_actor`.`actor_id` = 17) 
                 and (`sakila`.`film`.`length` > 120)
               )

All queries are translated internally to theta-style.

This post only discusses inner joins. With outer joins the situation is somewhat different. Read this post for more insight.

tags: ,
posted in MySQL by shlomi

« | »

Follow comments via the RSS Feed | Leave a comment | Trackback URL

14 Comments to "MySQL joins: ON vs. USING vs. Theta-style"

  1. Justin Swanhart wrote:

    Perhaps you forgot the special NATURAL JOIN which is kind of like USING , but with different semantics about which columns are returned from the query.

  2. shlomi wrote:

    @Justin,
    I did not forget - I just ignore its existence :)

  3. Ike Walker wrote:

    Great post! It begs the question, which syntax do you prefer?

    Personally, I strongly prefer ANSI-style over Theta-style because I find it much less error prone and more readable.

    As for USING vs. ON, I prefer ON. I find USING to be overly prescriptive in terms of column naming. Unless you never use self-referencing foreign keys or multiple foreign keys from a child table to the same parent table, then there will be cases where you need to use ON instead of USING.

    I'm not sure if you are advertising the "SELECT *" differences and the ability to omit the table alias as benefits of USING, but I do not see those as advantages. I try to avoid "SELECT *" as much as possible, especially when selecting from more than one table. And I also believe in always including the table alias for all columns in the SELECT clause when a join is involved. That way I protect myself from errors if I add a column in the future that would make a column reference in an existing query ambiguous.

  4. shlomi wrote:

    @Ike,
    When I started programming with SQL, I wan't even aware of the "JOIN ... ON" syntax. I was just doing Theta style.

    When I came to learn of the "JOIN ... ON" I realized it was a much cleaner syntax. I now use it regularly.

    I don't use USING much. Not because I don't like to, but because you need to use a very specific naming convention to be able to use it. When possible, I typically use it.

    There can be, at rare times, an optimization issue, where you want to explicitly use the film.film_id in some equation.

    I'm not advertising the SELECT * and rarely use it. I use it here because it simplifies the query and allows me to focus on other parts. Also, it is an essential case for the ince/twice issue discussed.

  5. Roland Bouman wrote:

    Hi!

    Nice write-up.

    My guess is, JOIN syntax will never cease to be an object of discussion. It's clear to me there will always be a need to specify very specific join conditions, and we can't get by with only "smart" join syntax like NATURAL and USING. (By "smart" I mean a declarative syntax, that does not require explicit comparisons and logical operators)

    That said, I think that at least 80% of all join syntax conforms to foreign keys, probably more in typical normalized OLTP data base schemas. So I think there certainly is room for a "smart" join syntax.

    Desite SQL being a declarative language, no RDBMS that I know of allows you to reference a foreign key constraint as a proxy for the join condition. (I wrote about it in detail here http://rpbouman.blogspot.nl/2006/04/intelligent-sql-join-syntax_09.html).

    TL;DR: if we have a table film_actor and it has a foreign key fk_film_actor_film to film, why can't we write:

    FROM film RELATE fk_film_actor_film

    or if you insist on keeping table names, why can't we write:

    FROM film JOIN film_actor VIA fk_film_actor_film

    The "named columns join" (aka the USING-syntax) looks like a step in the right direction, but it has a number of drawbacks that make me shun it, always.

    Shlomi pointed out that it requires a very strict naming convention. That's true, but I'd argue that no naming convention is ever strict enough to make it generically usable. Simple example from the sakila database: consider film and language. There are 2 foreign keys between them, both from film pointing to language. Since the foreign key columns in film must have unique names, we can only use the named columns join for one of them. So if we'd have a query that gets both the language and the original language for a film, we'd get:

    FROM film
    JOIN language USING (language_id)
    LEFT JOIN language original_language ON film.original_language_id = original_language.language_id

    So, basically we're doing the exact same thing in two different ways.

    Personally I prefer a uniform syntax over a more compact one, but I admit that this is ultimately a matter of taste. Some may have good reasons to prefer it the other way around.

    There is a more serious problem with the named columns join. It's ok if you know how it works, but it's not clear from the syntax itself. Consider this query:

    select address.address_id, customer.address_id, staff.address_id, store.address_id
    from address
    left join customer using(address_id)
    left join staff using (address_id)
    left join store using (address_id)

    what does this query mean? The first join is clear, take the rows from address and outer join to customer.

    But then, how should we join staff? does the second USING (address_id) refer to the shared column between customer and staff, or between address and staff? Same problem for the join to store.

    Don't blame yourself if you didn't know the answer, the syntax is such that it is unclear what is joined to what.

    My conclusion is that column names are simply not a good basis for defining declarative join syntax.

  6. shlomi wrote:

    Roland,

    Great examples on multiple choices (multiple FKs, multiple columns names address_id).

    With those examples it becomes very clear how being declarative SQL becomes ambiguous or undefined.

    I agree to the notion of RELATE using a FK.

    To be honest, I think SQL is a mess. It also uses ancient syntax rules. This doesn't mean I don't enjoy it, though :)

    But it is my belief that ANSI SQL could have been much better if redesigned, in many aspects.

  7. Roland Bouman wrote:

    Hi Shlomi - yeah I feel the same, it's a mess. But I love to tinker away and write queries.

  8. ypercube wrote:

    I think that the ANSIstyle UISNG syntax requires parenthesis to distinguish the anbiguity:

    FROM film JOIN film_actor USING (film_id, x)

    (where x is a column)

    vs

    FROM (film JOIN film_actor USING film_id) , x

    (where x is a table)

  9. shlomi wrote:

    @ypercube,
    In MySQL USING requires parenthesis regardless, so no ambiguities here.

  10. ypercube wrote:

    Sorry, I wasn't clear. I meant that without parenthesis, this would be ambiguous:

    FROM film JOIN film_actor USING film_id, x

    as to which of the two parenthesized versions above it should be parsed.

  11. shlomi wrote:

    @ypercube: you were very clear :)

    I'm saying this syntax is not allowed in MySQL anyway: you HAVE to put parenthesis with USING.

  12. Jayesh wrote:

    Hi,

    I had a doubt with respect to the ANSI syntax for join. How would the execution of the query change if i use the two queries given below -

    query 1:
    table_1 join table_2 using(col 1) where table_1.date_1 >= table_2.date_2 and table_1.date_1 >= table_2.date_3 and table_1.date_1 = table_2.date_2 and table_1.date_1 >= table_2.date_3 and table_1.date_1 < table_2.date_4);

    When i used query 1, my total records were reduced from 3 million to 600k. But it was not the case when i used the query 2.

    I am unable to understand what went wrong in query 1.

  13. Jayesh wrote:

    My apologies. I might have posted incomplete question.

    query 1:

    table_1 join table_2 using(col 1) where table_1.date_1 >= table_2.date_2 and table_1.date_1 >= table_2.date_3 and table_1.date_1 = table_2.date_2 and table_1.date_1 >= table_2.date_3 and table_1.date_1 < table_2.date_4);

    What is the difference between the execution and expected results from the 2 queries?

    Query 1 returned on 10% of records from query_1 while query 2 returned the desired result.

  14. shlomi wrote:

    What's query number two?

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by openark.org