I need to know which performs better in hive. EXISTS or IN?
My understanding is it depends on the result size for the sub-query.
You know I don't know. The fastest way would be to try both, and see which is faster. In general you could use explain to compare the two but you do need to understand how to read 'explain' it's not that intuitive. (I would skip trying to look at the syntax tree and focus on the dependencies.)
If you are concerned about optimizing you query read over this article which highlights some quick wins for speeding up hive.
exists keyword it's intended as a way to avoid counting of the entire collection or table. Something like:
--this statement needs to check the entire table select count(*) from [table] where ... --this statement is true as soon as one match is found exists ( select * from [table] where ... )
The in is best used where you have a static list to pass:select * from [table] where [field] in (1, 2, 3)
Assuming that they are used to achieve the exact same functionality, which I find it as a big stretch,
EXISTS is much faster than
IN. Some may say that IN is faster for smaller data set, but that could be an absolute random result when maybe the list was already cached or the field used was maybe an integer and the table had one record :)
Let's understand how they work.
EXISTS will return a boolean TRUE as soon as the condition is met and no further table scan is performed. While, IN will still do a full table scan. I hope that clarifies it why in a large table EXISTS will be always faster. If the table has one record :) then the chance is 50-50%.
If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.
Can you explain in a little more detail?
'Exists' does have to do 2 table scans and join the results. (And yes it can shortcut the boolean logic)
'In' only does one table scan, and no join.
Is it because the work is better distributed via reducers with 'Exists'?
I wished I could see the SQL in question. If you are talking about a query where you want to select some fields from one table where a field value EXISTS in other table, I get it. It is two full table scans, most likely a small one (the lookup table) and a bigger one, the transactions table. That is useful when you have a lookup-like table which you don't know the values. That is a traditional dynamic list filtering problem. IN is useful for a static list. That's why I don't get it why we have to compare EXISTS with IN. Use IN if you have a static list. That is one table scan.
My theory was about the general use of EXISTS to determine whether a specific record field value from the transactions table exists in the lookup table. That will not be necessarily a full table scan.
I'm totally in agreement with @Constantin Stanca that in SQL 'IN' is for static lists and Exists is for dynamic data sets. His point is solid that they are different use cases.
Exists will not shortcut in hive. (with default map/reduce.) Map/Reduce jobs don't have a shortcut method. (That said, if you use a different engine under the hood for HIVE like SPARK or Tez.. lots of those engines do use optimizations/partitions/strategies to only pull back the required data.) I did test a small query on a small data set and 'IN' and 'EXISTS' ran in exactly the same time. (On Map/reduce)
When I looked at the Execution plan for both queries (using a sub query for both EXISTS and IN) they were functionally equivalent. Meaning it doesn't matter for map/reduce what you use from a speed perspective. (If you are using a subquery.) To me this lends more strength to @Constantin Stanca statement. Follow the SQL convention.