Experienced DBAs and Developers…might as well pass on by this one ๐
Today, I want to turn SQL into English for (mostly) Developers and (some) DBAs that are asked to develop or tune queries.
No matter where I go, I always run into a Dev that loves to use NOT IN. I get it. Its simple, it makes sense, and it works. You will get back the correct rows.
BUT!!! Lets talk through a simple example as to why a JOIN is going to be far more efficient and make your DBAs and managers very happy….
Scenario:
You are a book lover, and want something new to read. So you go to the local Narnes and Boble. When you get there, you find something that sounds familiar, but you can’t remember if it is in your collection at home or not.
The Setup:
I created a table (AllBooks) with a thousand rows and another table (MyBooks) with 49. Both have a BookID (pretend its the ISBN number so the analogy works).
My two choices for determining what books I do not own…look for books NOT IN my current collection, or JOIN the two lists (yeah…the analogy breaks down a bit here, but stick with me).
The Queries:
Basic stuff, right? Both will return 951 records (books) that I do not own. And, very quickly…because the tables are tiny. Sub-1 second is fast.
The issue here is HOW the rows are compared.
English version now, techy stuff later:
In the first query, this is equivalent to you standing at the bookstore and calling home to have someone check to see if the book in your hand is already in your collection. EVERY time. One by one.
In the second, you got really smart and brought a list with you, which you are comparing to the books on the shelf at the store. You’ve got both “lists” in one place, so it is far more efficient.
Techy stuff/Execution Plans:
Disclaimer…I’m not trying to teach you to read execution plans or even know the deeper meaning of the various operators…just going to show the plans behind the queries above. If you DO want to know more than you can ever use, go download Grant Fritchey‘s e-book. And Follow/Stalk Grant on Twitter. He loves that.
The first query uses Nested Loop operators to work through the two lists:
“For each row in the top (outer) input, scan the bottom (inner) input, and output matching rows”
The second uses a Hash:
“Use each row from the top input (MyBooks) to build a hash table, and each row from the bottom input (AllBooks) to probe into the hash table outputting all matching rows”
These were run at the same time. 85% of the time spent on the NOT IN, compared to 15% on the JOIN. This was for a very small sample of data…1000 rows and 49 rows. Take this up tens of thousands, or millions and you’ll start to see significant delays in getting results using NOT IN.
Short story long…if you are writing or tuning queries, this is an easy win.
Go get some fries. And some books….49 is a pitifully small number for someone of your intelligence and ability ๐
Kevin3NF
It depends. With small datasets you get one plan. Try it on some Millon row tables with appropriate indexes. You'll get different plans
there are use cases, where in and not in are faster: consider you are having a column with many Null values, e.g. having a CoverImageID column and an Images table with ImageID, but most of your books don't have the CoverImageID populated. if you are searching for a book with a specific cover image, it might be more efficient to query SELECT * FROM All_books WHERE CoverImageID IN (SELECT ImageID FROM Images WHERE fileName Like @fn).
Thanks for the feedback!
Both good points. The point of the post was to explain to Junior DBAs, Analysts, etc how different syntax that does the same thing can be wildly different in performance, in simple, easy to relate to terms ๐
This actually came out of a coffee room conversation with an Analyst that only queries data…
The NOT IN will work much better if it is correlated:
SELECT
AB.*
FROM
AllBooks AS AB
WHERE
BookID NOT IN
(SELECT BookID from MyBooks WHERE BookID = AB.BookID);
The resulting plan gives the best of both: the efficiency of a left anti semi join, but minus the filter of the NOT NULL query.
A third way would be an NOT EXISTS query:
SELECT AB.*
FROM AllBooks AS AB
WHERE NOT EXISTS (SELECT BookID from MyBooks WHERE BookID = AB.BookID);
A EXISTS query (without NOT) is particularly useful, if it is a 1:n query, e.g. because MyBooks not only contains your own books, but also the books of your wife, kids, parents, best friends and some books are very populare, so that there are multiple entries in your table. If you would use an INNER JOIN in this case, you would receive duplicates (three times the same AB.*-record), with the EXISTS (or the IN) clause you would get it only once.