Join vs. IN Not IN

Experienced DBAs and Developers…might as well pass on by this one 🙂

Today, I want to turn SQL into English for (mostly) Developers and (some) DBAs that are asked to develop or tune queries.

No matter where I go, I always run into a Dev that loves to use NOT IN.   I get it.  Its simple, it makes sense, and it works.  You will get back the correct rows.

BUT!!!   Lets talk through a simple example as to why a JOIN is going to be far more efficient and make your DBAs and managers very happy….

Scenario:

You are a book lover, and want something new to read.  So you go to the local Narnes and Boble.   When you get there, you find something that sounds familiar, but you can’t remember if it is in your collection at home or not.

The Setup:

I created a table (AllBooks) with a thousand rows and another table (MyBooks) with 49.  Both have a BookID (pretend its the ISBN number so the analogy works).

My two choices for determining what books I do not own…look for books NOT IN my current collection, or JOIN the two lists (yeah…the analogy breaks down a bit here, but stick with me).

The Queries:

Basic stuff, right?   Both will return 951 records (books) that I do not own.  And, very quickly…because the tables are tiny.   Sub-1 second is fast.

The issue here is HOW the rows are compared.

English version now, techy stuff later:

In the first query, this is equivalent to you standing at the bookstore and calling home to have someone check to see if the book in your hand is already in your collection.  EVERY time.  One by one.

In the second, you got really smart and brought a list with you, which you are comparing to the books on the shelf at the store.   You’ve got both “lists” in one place, so it is far more efficient.

Techy stuff/Execution Plans:

Disclaimer…I’m not trying to teach you to read execution plans or even know the deeper meaning of the various operators…just going to show the plans behind the queries above.  If you DO want to know more than you can ever use, go download Grant Fritchey‘s e-book.   And Follow/Stalk Grant on Twitter.  He loves that.

The first query uses Nested Loop operators to work through the two lists:
“For each row in the top (outer) input, scan the bottom (inner) input, and output matching rows”

The second uses a Hash:
“Use each row from the top input (MyBooks) to build a hash table, and each row from the bottom input (AllBooks) to probe into the hash table outputting all matching rows”

These were run at the same time.  85% of the time spent on the NOT IN, compared to 15% on the JOIN.   This was for a very small sample of data…1000 rows and 49 rows.   Take this up tens of thousands, or millions and you’ll start to see significant delays in getting results using NOT IN.

Short story long…if you are writing or tuning queries, this is an easy win.

Go get some fries.  And some books….49 is a pitifully small number for someone of your intelligence and ability 🙂

Kevin3NF

Leave a Comment

Sign up for our Newsletter

%d bloggers like this: