Support Questions

Find answers, ask questions, and share your expertise

NOT IN operator does not work as expected in hive

avatar
Expert Contributor

Though the IN operator in hive works just fine in this case as well but NOT IN doesn't.

select DISTINCT SF.customer_email
FROM 
Magento.sales_flat_order SF
WHERE
YEAR(TO_DATE(SF.created_at)) = '2016'
AND 
MONTH(TO_DATE(SF.created_at)) = '6'
AND  
SF.customer_email 
NOT IN (
select SFO.customer_email FROM Magento.Sales_flat_order SFO 
WHERE
TO_DATE(SFO.created_at) >= '2016-05-31'
)

If I replace NOT IN with IN operator, that works. Inact using NOT IN with a list of strings specified works too but it somehow does not work with select statement.

How do I get this working ?

3 REPLIES 3

avatar

Hi @Simran Kaur. I notice this is a similar question as the last one you posted here: (https://community.hortonworks.com/questions/43885/not-in-operator-in-hive-with-a-subquery.html#answer-43944). I am assuming that back then you checked the results of the inner query and the outer query and that you are getting the results you expect.

Your query above should return zero rows - you are looking for email addresses created in June of 2016 that are NOT in the list of email addresses created greater than May of 2016. This would return zero rows.

So rather than an error - an empty set is the correct result.

avatar
Expert Contributor
@bpreachuk

Oh no no, That was a typo. I changed it to TO_DATE(SFO.created_at) <= '2016-05-31' and it still returns empty resultset

avatar

OK. That's good. I have not seen any specific issue with NOT IN.

Without seeing the data it stands to be asked again - You have confirmed that (a) count(*) on the inner query returns results, (b) that it is not failing silently, and (c) that there are 1+ email addresses in June do not exist prior to June...

Here are 2 things to check to see if it is indeed a Hive issue...

(1) Try running the query once with MapReduce as the execution engine and then with Tez as the execution engine and see if you get differing results.

   set hive.execution.engine=mr; 
   set hive.execution.engine=tez;

(2) Change the query to use a temporary table and evaluate your results that way (double-check my syntax but it should be close to correct...

create temporary table temp_table 
as select customer_email FROM Table1  
WHERE TO_DATE(created_at) < '2016-05-31'; 

select DISTINCT SF.customer_email
FROM  Magento.sales_flat_order SF
WHERE YEAR(TO_DATE(SF.created_at)) = '2016'
AND  MONTH(TO_DATE(SF.created_at)) = '6'
AND  SF.customer_email NOT IN (select customer_email from temp_table); 

select DISTINCT SF.customer_email
FROM  Magento.sales_flat_order SF
   left join temp_table 
       on SF.customer_email = temp_table.customer_email 
WHERE YEAR(TO_DATE(SF.created_at)) = '2016'
AND  MONTH(TO_DATE(SF.created_at)) = '6'
AND  temp_table.customer_email is NULL