- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
best Compression technique?
- Labels:
-
Apache Hive
Created on ‎10-20-2017 06:13 PM - edited ‎09-16-2022 05:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I have tables in hive. some are text format, some are orc... what are the best compression methods to compress the text file so that i can still query on the table?
Created ‎10-20-2017 06:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are using ORC Snappy on production cluster and Here are few useful links on compression:
Created ‎10-20-2017 06:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Text file : As text files are also included , please choose a compression algorithm which support splits. (gz is a big No)
Orc : Orc does a block level compression hence always splittable .
Overall : Please use ZLIB, SNAPPY, lzo splliatble for compression.
*orc.compress : high level compression (one of NONE, ZLIB, SNAPPY)
Created ‎10-20-2017 06:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are using ORC Snappy on production cluster and Here are few useful links on compression:
Created ‎10-20-2017 06:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
how about parquet?
Created ‎10-21-2017 12:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ORC and parquet both columnar format data storage and are competitors in terms of support an development.
Created ‎10-21-2017 05:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ORC is best option with in hive and Parquet is best option across Hadoop ecosystem.
