Looking for some guidance on the size and compression of Parquet files for use in Impala. We have written a spark program that creates our Parquet files and we can control the size and compression of the files (Snappy, Gzip, etc). Now we just need to make a decision on their size and compression.
I've read Snappy is splittable, and you should make your files bigger than your HDFS block size. So if our block size is 128mb, would 1GB snappy compressed parquet files be a good choice? Is it possible to increase or decrease the amount of compression with Snappy?
Gzip gives us better compression. But since it's not splittable, what should we set the max file size to if using gzip? Should it be no more than our HDFS block size?
We will have lots of partitions and some of them will be large, hundreds of GB or bigger. The total amount of data will be hundreds of Terabytes or more.
Good question, I suppose I'm looking for an optimal balance of both, maybe with a lean toward performance rather than disk space usage. We don't care about compression speed, but do care about compression rate and decompression speed.
We're currently using sequence files lzo compressed and Hive to query the data. We've decided to convert the data over to Parquet to use with Impala because we see huge query performance gains in our smaller dev environment. We'd also like to pick a future-proof option that will work well with other tools that can query/analyze a large set of data in Parquet format.
The data is constantly flowing in from flume, but once it goes through our ingestion pipeline/ETL it won't change.
I haven't been able to find much info on the web about whether to compress Parquet with splittable Snappy or non-splittable Gzip, or discussion about optimal file sizes on HDFS for each.
Based on the response , I would go for Parquet with splittable Snappy . more over
Hive - default Compression is DeflateCodec Impala - default Compression is Snappy
i would put a link of my response to a similar post in the community that will give you little more info.
please let me know if that helps
After some experimenting, here's what we found. I'd be interested if anyone enthusiastically agrees or passionately disagrees this.
Tested the following scenarios:
1) 128mb block, 128mb file, gzip
2) 128mb block, 1gb file, gzip
3) 1gb block, 1gb file, gzip
4) 128mb block, 128 file, snappy
5) 128mb block, 1gb file, snappy
6) 1gb block, 1gb file, snappy
The worst in storage and performance seemed to be the 2 cases where the block size was much smaller than the file size in both compression formats, so strike out #2 and #5.
The performance for 1, 3, 4, and 6 all seemed to be very similar in the queries we tested. But gzip used only about 60% as much storage. So, probably going with gzip.
Finally we're thinking the smaller block and file size is probably the way to go to get a little more parallelism.
Gzip decompression will definitely use more CPU than snappy decompression, so I'd usually expect Gzip to give you worse performance, unless your query is limited by disk I/O (in which case smaller is better) or if your query isn't limited by scan performance.