Created on 12-01-201603:21 AM - edited 08-17-201907:44 AM
Row and Columnar Storage For Hive
Customers often ask about columnar storage formats for Hive tables and
when to use them. It depends on your uses cases. If your data access
patterns mostly involve selecting a few columns to perform aggregations,
then using columnar storage will save disk space, reduce I/O when
fetching data, and improve query execution time. In this article I
tested the columnar ORC file format for Hive to quantify how it
outperforms row based Text files when used for queries that aggregate and group on a few columns from a table with many columns.
ORC is a columnar storage format used in Hadoop for Hive tables. It is an efficient file format for storing data in which records contain many columns. An example is Clickstream (web) data to analyze website activity and performance. Queries typically retrieve a subset of columns for each row.
SUMMARY OF TEST RESULTS:
Comparing ORC vs Text files for storing 22 million clickstream page view records in Hive tables
Each row as 40 columns
Data files are uncompressed (i.e. no CODEC such as Snappy, ZLIB, LZO, etc…)
52% Reduction in disk space when using ORC. High columnar storage compression:
97% Reduction in disk I/O when using ORC:
21% Improvement in HiveQL query execution time when using ORC:
HiveQL is counting the number of page views by Operating System
and then ranking each Operating System by order of most page views in
descending order. Sample output showing a few rows:
Using ORC file format may not always equate to significantly less Memory
and CPU usage for analytical queries than if using row based Text files
for high volumes of data. In fact, memory usage could be greater when
using ORC format. You can optimize Memory usage by optionally
compressing the data using a CODEC such as ZLIB or Snappy; however, CPU
time will then increase because of compression and decompression.
Great post Binu! What storage format would you suggest if you plan on storing the hive table into a dataframe and running an iterative process (machine learning algorithm x) against the data? I’m hard pressed to find any kind of discussions on this concept.