Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ORC file and Sequence file reading same data for a query with where clause?

ORC file and Sequence file reading same data for a query with where clause?

New Contributor
I have a table A with a column age (string) in it. (table size is 74GB, Hive 0.13)

Now I created a table B with same data as of A, but in ORC file format and also included sort by age while creating B.

 

Now when I run the query:

select count(id) from X where age=25;

 

The data read from B table is same as size of B (I expected predicate pushdown) and also the time taken in A is almost equal to B.

Theoretically, because of predicate pushdown in ORC, a lot of data must have been skipped thus saving a lot of time. I suspect the indexes created in orc file are not getting read. I have tried almost everything. But nothing works. Please help with this.

5 REPLIES 5

Re: ORC file and Sequence file reading same data for a query with where clause?

Champion

is table stats being collected in the table b ? 

what was the file fomart being used in table a  ? 

do you have any compression ? 

what are you looking for throughput or lantenacy ? 

Re: ORC file and Sequence file reading same data for a query with where clause?

New Contributor

@csguna

Yes I have tried

SET hive.compute.query.using.stats=true;

SET hive.stats.fetch.column.stats=true;

Can you tell me which stats are these?

 

Table A was in sequence file format.

Yes by default ZLIB in enabled in B and snappy in A.

I am looking for latency.

Re: ORC file and Sequence file reading same data for a query with where clause?

Champion

Impala COMPUTE STATS statement  gathers both table and column statistics. 

that will help us in query optimization. 

if it is a partioned table consider using COMPUTE INCREMENTAL STATS.

Would you consider switiching to  parquet format it will work nice with large sets of data as it is also a coloumnar storage layout. 

 

is your table is partioned or bucketed ? 

 

hive.cbo.enable property - is it enabled  ? 

 

 

 

 

 

Re: ORC file and Sequence file reading same data for a query with where clause?

New Contributor

Actually not Impala but I run Hive 0.13.

My table is partitioned on date basis.

In Hive 0.13, I didn't found any CBO property.

Re: ORC file and Sequence file reading same data for a query with where clause?

Champion

I would suggest you to tweak the  stripe.size , index.stride based on the size of your ORC file . 

Let  me know if that helps