12-04-2017 04:46 PM - last edited on 12-05-2017 05:42 AM by cjervis
Looking for some guidance on the size and compression of Parquet files for use in Impala. We have written a spark program that creates our Parquet files and we can control the size and compression of the files (Snappy, Gzip, etc). Now we just need to make a decision on their size and compression.
I've read Snappy is splittable, and you should make your files bigger than your HDFS block size. So if our block size is 128mb, would 1GB snappy compressed parquet files be a good choice? Is it possible to increase or decrease the amount of compression with Snappy?
Gzip gives us better compression. But since it's not splittable, what should we set the max file size to if using gzip? Should it be no more than our HDFS block size?
We will have lots of partitions and some of them will be large, hundreds of GB or bigger. The total amount of data will be hundreds of Terabytes or more.
12-05-2017 09:46 AM - edited 12-05-2017 10:11 AM
Good question, I suppose I'm looking for an optimal balance of both, maybe with a lean toward performance rather than disk space usage. We don't care about compression speed, but do care about compression rate and decompression speed.
We're currently using sequence files lzo compressed and Hive to query the data. We've decided to convert the data over to Parquet to use with Impala because we see huge query performance gains in our smaller dev environment. We'd also like to pick a future-proof option that will work well with other tools that can query/analyze a large set of data in Parquet format.
The data is constantly flowing in from flume, but once it goes through our ingestion pipeline/ETL it won't change.
I haven't been able to find much info on the web about whether to compress Parquet with splittable Snappy or non-splittable Gzip, or discussion about optimal file sizes on HDFS for each.
12-05-2017 08:30 PM
Based on the response , I would go for Parquet with splittable Snappy . more over
Hive - default Compression is DeflateCodec Impala - default Compression is Snappy
i would put a link of my response to a similar post in the community that will give you little more info.
please let me know if that helps
12-12-2017 08:37 AM