Reply
New Contributor
Posts: 4
Registered: ‎02-01-2018

ORC file and Sequence file reading same data for a query with where clause?

I have a table A with a column age (string) in it. (table size is 74GB, Hive 0.13)

Now I created a table B with same data as of A, but in ORC file format and also included sort by age while creating B.

 

Now when I run the query:

select count(id) from X where age=25;

 

The data read from B table is same as size of B (I expected predicate pushdown) and also the time taken in A is almost equal to B.

Theoretically, because of predicate pushdown in ORC, a lot of data must have been skipped thus saving a lot of time. I suspect the indexes created in orc file are not getting read. I have tried almost everything. But nothing works. Please help with this.

Champion
Posts: 731
Registered: ‎05-16-2016

Re: ORC file and Sequence file reading same data for a query with where clause?

is table stats being collected in the table b ? 

what was the file fomart being used in table a  ? 

do you have any compression ? 

what are you looking for throughput or lantenacy ? 

New Contributor
Posts: 4
Registered: ‎02-01-2018

Re: ORC file and Sequence file reading same data for a query with where clause?

@csguna

Yes I have tried

SET hive.compute.query.using.stats=true;

SET hive.stats.fetch.column.stats=true;

Can you tell me which stats are these?

 

Table A was in sequence file format.

Yes by default ZLIB in enabled in B and snappy in A.

I am looking for latency.

Champion
Posts: 731
Registered: ‎05-16-2016

Re: ORC file and Sequence file reading same data for a query with where clause?

[ Edited ]

Impala COMPUTE STATS statement  gathers both table and column statistics. 

that will help us in query optimization. 

if it is a partioned table consider using COMPUTE INCREMENTAL STATS.

Would you consider switiching to  parquet format it will work nice with large sets of data as it is also a coloumnar storage layout. 

 

is your table is partioned or bucketed ? 

 

hive.cbo.enable property - is it enabled  ? 

 

 

 

 

 

New Contributor
Posts: 4
Registered: ‎02-01-2018

Re: ORC file and Sequence file reading same data for a query with where clause?

Actually not Impala but I run Hive 0.13.

My table is partitioned on date basis.

In Hive 0.13, I didn't found any CBO property.

Champion
Posts: 731
Registered: ‎05-16-2016

Re: ORC file and Sequence file reading same data for a query with where clause?

I would suggest you to tweak the  stripe.size , index.stride based on the size of your ORC file . 

Let  me know if that helps 

Announcements