Created 07-12-2016 04:59 AM
Though the IN operator in hive works just fine in this case as well but NOT IN doesn't.
select DISTINCT SF.customer_email FROM Magento.sales_flat_order SF WHERE YEAR(TO_DATE(SF.created_at)) = '2016' AND MONTH(TO_DATE(SF.created_at)) = '6' AND SF.customer_email NOT IN ( select SFO.customer_email FROM Magento.Sales_flat_order SFO WHERE TO_DATE(SFO.created_at) >= '2016-05-31' )
If I replace NOT IN with IN operator, that works. Inact using NOT IN with a list of strings specified works too but it somehow does not work with select statement.
How do I get this working ?
Created 07-12-2016 01:36 PM
Hi @Simran Kaur. I notice this is a similar question as the last one you posted here: (https://community.hortonworks.com/questions/43885/not-in-operator-in-hive-with-a-subquery.html#answer-43944). I am assuming that back then you checked the results of the inner query and the outer query and that you are getting the results you expect.
Your query above should return zero rows - you are looking for email addresses created in June of 2016 that are NOT in the list of email addresses created greater than May of 2016. This would return zero rows.
So rather than an error - an empty set is the correct result.
Created 07-12-2016 01:44 PM
Oh no no, That was a typo. I changed it to TO_DATE(SFO.created_at) <= '2016-05-31' and it still returns empty resultset
Created 07-12-2016 02:08 PM
OK. That's good. I have not seen any specific issue with NOT IN.
Without seeing the data it stands to be asked again - You have confirmed that (a) count(*) on the inner query returns results, (b) that it is not failing silently, and (c) that there are 1+ email addresses in June do not exist prior to June...
Here are 2 things to check to see if it is indeed a Hive issue...
(1) Try running the query once with MapReduce as the execution engine and then with Tez as the execution engine and see if you get differing results.
set hive.execution.engine=mr;
set hive.execution.engine=tez;
(2) Change the query to use a temporary table and evaluate your results that way (double-check my syntax but it should be close to correct...
create temporary table temp_table as select customer_email FROM Table1 WHERE TO_DATE(created_at) < '2016-05-31'; select DISTINCT SF.customer_email FROM Magento.sales_flat_order SF WHERE YEAR(TO_DATE(SF.created_at)) = '2016' AND MONTH(TO_DATE(SF.created_at)) = '6' AND SF.customer_email NOT IN (select customer_email from temp_table); select DISTINCT SF.customer_email FROM Magento.sales_flat_order SF left join temp_table on SF.customer_email = temp_table.customer_email WHERE YEAR(TO_DATE(SF.created_at)) = '2016' AND MONTH(TO_DATE(SF.created_at)) = '6' AND temp_table.customer_email is NULL