Created on 12-30-201609:03 PM - edited 08-17-201906:22 AM
ORC is a columnar storage format for Hive.
This document is to explain how creation of ORC data files
can improve read/scan performance when querying the data. TEZ execution engine provides
different ways to optimize the query, but it will do the best with correctly
created ORC files.
ORC Creation Strategy.
CREATE [EXTERNAL] TABLE OrcExampleTable
(clientid int, name string, address string, age int)
stored as orc
Ingesting data into Hive tables heavily depends on usage
patterns. In order to make queries running efficiently, ORC files should be
created to support those patterns.
important/frequent queries that will be running against your data set (based on
filter or JOIN conditions)
-Configure optimal data file
-Configure stripe and stride
-Distribute and sort data
-Run “analyze” table in
order to keep statistics updated
Filters are mainly used in “WHERE” clause and “JOIN … ON”. An
information about the fields being used in filters should be used as well for
choosing correct strategy for ORC files creation.
select * from orcexampletable
where clientid=100 and age between 25 and 45;
Does size matter?
As known, small files are a pain in HDFS. ORC files aren’t
different than others. Even worse.
First of all, small files will impact NameNode memory and
performance. But more importantly is response time from the query. If ingestion
jobs generate small files, it means there will be large number of the files in
When query is submitted, TEZ will need an information about
the files in order to build an execution plan and allocate resources from YARN.
So, before TEZ engine starts a job:
-TEZ gets an information
from HCat about table location and partition keys. Based on this information
TEZ will have exact list of directories (and subdirectories) where data files
can be found.
-TEZ reads ORC footers and
stripe level indices in each file in order to determine how many blocks
of data it will need to process. This is where the problem of large number of
files will impact the job submission time.
-TEZ requests containers
based on number of input splits. Again, small files will cause less flexibility
in configuring input split size, and as result, larger number of containers
will need to be allocated
Note, if query submit stage time-outs, check the number of
ORC files (also, see below how ORC split strategy (ETL vs BI) can affect query submission
There is always a trade-off between ingestion query
performance. Keep to a minimum number of ORC files being created, but to
satisfy acceptable level of ingestion performance and data latency.
For transactional data being ingested continuously during
the day, set up daily table/partition re-build process to optimize number of
files and data distribution.
Stripes and Strides.
ORC files are splittable on a stripe level. Stripe
size is configurable and should depend on average length (size) of records and
on how many unique values of those sorted fields you can have. If search-by
field is unique (or almost unique), decrease stripe size, if heavily repeated –
increase. While default is 64 MB, keep stripe size in between ¼ of block-size
to 4 blocks-size (default ORC block size is 256 MB). Along with that you can
play with input split size per job to decrease number of containers required
for a job. Sometimes it’s even worth to reconsider HDFS block size (default HDFS
block size if 128 MB).
Stride is a set of records for which range index
(min/max and some additional stats) will be created. Stride size (number of
records, default 10K): for unique values combinations of fields in bloom filter
(or close to unique) – go with 3-7 K records. Non-unique 7-15 K records or even
more. If bloom filter contains unsorted fields, that will also make you go with
smaller number of records in stride.
Bloom filter can be used on sorted field in
combination with additional fields that can participate in search-by clause.
Sorting and Distribution.
Most important for efficient search within the data set is
how this set stored.
Since TEZ utilize ORC file level information (min/max range
index per field, bloom filter, etc.), it is important that those ranges will
give the best reference to the exact block of data having desired values.
Here is an example:
This example shows that with unsorted data, TEZ will request
4 containers and up to full table scan, while with sorted data – only single
container for single stripe read.
The best result can be achieved by globally sorting the data
in a table (or partition).
Global sorting in Hive (“ORDER BY”) enforces single reducer
to sort final data set. It can be inefficient. That’s when “DISTRIBUTE BY”
comes in help.
For example, let’s say we have daily partition with 200 GB
and field “clientid” that we would like to sort by. Assuming we have enough
power (cores) to run 20 parallel reducers, we can:
1. Limit number of reducers to
2. Distribute all the records
to 20 reducers equally:
insert into …
select … from …
distribute by floor(clientid/((<max(clientid)> – <min(clientid)> + 1)/ 20 )
sort by clientid.
Note, this will work well if client ID values are distributed evenly on scale
between min and max values. If that’s not the case, find better distribution
function, but ensure that ranges of values going to different reducers aren’t
3. Alternatively, use PIG to order by client id (with parallel 20).