Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

No mention of ORC in reducer logs for select query.

No mention of ORC in reducer logs for select query.

New Contributor

I created a ORC table A:

create table A like B;
alter table A set fileformat orc;
insert into table A select * from B sort by age; 

But, When I run query,

select id from A where age=60; 

The data read from HDFS is almost equal to the size of A. I expected lesser data due to predicate pushdown. Also, in reducer logs there is no mention of ORC.

4 REPLIES 4
Highlighted

Re: No mention of ORC in reducer logs for select query.

Contributor

@Akshat Mathur

Even if hive.optimize.ppd is set to true, skipping chunks of data depend upon on how your data is organized.

Can you share following details :

1. total records in the table

2. Min and max value of the column

3. orc.stripe.size and orc.row.index.stride

Re: No mention of ORC in reducer logs for select query.

New Contributor

1) 43,89,73,994 rows

2) min 0 and max 99 (age column also contains NULL, empty values)

3) stripe size is 256MB and stride is 10,000

I sorted the data by age while inserting.

Re: No mention of ORC in reducer logs for select query.

Contributor
@Akshat Mathur

Please use Distribute by and Sort by or Cluster by for ordering because in only sort by reducers can have overlapping ranges, it sorts only at the reducer level.

For this particular scenario, you can even increase the stride to 15000 as you have non-unique values for that column [43,89,73,994 records for a small range]

Re: No mention of ORC in reducer logs for select query.

New Contributor

@rtrivedi Thank for the reply. Is this what you are saying?

create table A like B;
alter table A set fileformat orc;
alter table A set tblproperties("orc.row.index.stride"="15000");
insert into table A select * from B cluster by age;