Support Questions
Find answers, ask questions, and share your expertise

Hive: where + in does not use partition?

Hive: where + in does not use partition?

New Contributor

I am querying a large table that is partitioned on a field called day. If I run a query:

select * from my_table where day in ('2016-04-01', '2016-03-01')

I get many mappers and reducers and the query takes a long time to run.

If, however, I write a query:

select * from my_table where day = '2016-04-01' or day = '2016-03-01'

I get far less mappers and reducers and the query runs quickly. To me this suggests that indoes not take advantage of partitions in a table. Can anyone confirm this and explain why?

Hive Version: 1.2.1 Hadoop Version: 2.3.4.7-4

Details: I believe the relevant part of the execution plans are... Using Where or No filter operator at all Using Where in Filter Operator predicate: (day) IN ('2016-04-01', '2016-03-01') (type: boolean) Statistics: Num rows: 100000000 Data size: 9999999999

The hive docs just say: 'What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns.'

But don't elaborate.

Thanks!

1 REPLY 1

Re: Hive: where + in does not use partition?

Cloudera Employee

Can you share your setting for the following parameter:

set hive.optimize.reducededuplication.min.reducer=1;

If not set as 1, how many reducers do you see when we are set at 1?