I created a ORC table A:
create table A like B; alter table A set fileformat orc; insert into table A select * from B sort by age;
But, When I run query,
select id from A where age=60;
The data read from HDFS is almost equal to the size of A. I expected lesser data due to predicate pushdown. Also, in reducer logs there is no mention of ORC.
Even if hive.optimize.ppd is set to true, skipping chunks of data depend upon on how your data is organized.
Can you share following details :
1. total records in the table
2. Min and max value of the column
3. orc.stripe.size and orc.row.index.stride
1) 43,89,73,994 rows
2) min 0 and max 99 (age column also contains NULL, empty values)
3) stripe size is 256MB and stride is 10,000
I sorted the data by age while inserting.
Please use Distribute by and Sort by or Cluster by for ordering because in only sort by reducers can have overlapping ranges, it sorts only at the reducer level.
For this particular scenario, you can even increase the stride to 15000 as you have non-unique values for that column [43,89,73,994 records for a small range]
@rtrivedi Thank for the reply. Is this what you are saying?
create table A like B; alter table A set fileformat orc; alter table A set tblproperties("orc.row.index.stride"="15000"); insert into table A select * from B cluster by age;