Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Row and Columnar Storage For Hive

Customers often ask about columnar storage formats for Hive tables and when to use them. It depends on your uses cases. If your data access patterns mostly involve selecting a few columns to perform aggregations, then using columnar storage will save disk space, reduce I/O when fetching data, and improve query execution time. In this article I tested the columnar ORC file format for Hive to quantify how it outperforms row based Text files when used for queries that aggregate and group on a few columns from a table with many columns.

ORC is a columnar storage format used in Hadoop for Hive tables. It is an efficient file format for storing data in which records contain many columns. An example is Clickstream (web) data to analyze website activity and performance. Queries typically retrieve a subset of columns for each row.

SUMMARY OF TEST RESULTS:

  • Comparing ORC vs Text files for storing 22 million clickstream page view records in Hive tables
  • Each row as 40 columns
  • Data files are uncompressed (i.e. no CODEC such as Snappy, ZLIB, LZO, etc…)
  • 52% Reduction in disk space when using ORC. High columnar storage compression:

    9978-screen-shot-2016-11-30-at-20309-pm.png

  • 97% Reduction in disk I/O when using ORC:

    9973-screen-shot-2016-11-30-at-20512-pm.png

  • 21% Improvement in HiveQL query execution time when using ORC:

    9979-screen-shot-2016-11-30-at-20116-pm.png

    HiveQL is counting the number of page views by Operating System and then ranking each Operating System by order of most page views in descending order. Sample output showing a few rows:

    10032-screen-shot-2016-11-30-at-14339-pm.png

  • Using ORC file format may not always equate to significantly less Memory and CPU usage for analytical queries than if using row based Text files for high volumes of data. In fact, memory usage could be greater when using ORC format. You can optimize Memory usage by optionally compressing the data using a CODEC such as ZLIB or Snappy; however, CPU time will then increase because of compression and decompression.

screen-shot-2016-11-30-at-20512-pm.png9980screen-shot-2016-11-30-at-14339-pm.pngscreen-shot-2016-11-30-at-14339-pm.pngscreen-shot-2016-11-30-at-20102-pm.pngscreen-shot-2016-11-30-at-20116-pm.pngscreen-shot-2016-11-30-at-20048-pm.png
16,712 Views
Comments
avatar
New Contributor

Great post Binu! What storage format would you suggest if you plan on storing the hive table into a dataframe and running an iterative process (machine learning algorithm x) against the data? I’m hard pressed to find any kind of discussions on this concept.