Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Recommended file size for Impala Parquet files?

avatar
Contributor

Looking for some guidance on the size and compression of Parquet files for use in Impala.  We have written a spark program that creates our Parquet files and we can control the size and compression of the files (Snappy, Gzip, etc).  Now we just need to make a decision on their size and compression.

 

I've read Snappy is splittable, and you should make your files bigger than your HDFS block size.  So if our block size is 128mb, would 1GB snappy compressed parquet files be a good choice?  Is it possible to increase or decrease the amount of compression with Snappy?

 

Gzip gives us better compression.  But since it's not splittable, what should we set the max file size to if using gzip? Should it be no more than our HDFS block size?

 

We will have lots of partitions and some of them will be large, hundreds of GB or bigger.  The total amount of data will be hundreds of Terabytes or more.

7 REPLIES 7

avatar
Champion

are you looking for fast compression or decompression ? 

are you looking for disk space consumption ?

 

avatar
Contributor

Good question, I suppose I'm looking for an optimal balance of both, maybe with a lean toward performance rather than disk space usage.  We don't care about compression speed, but do care about compression rate and decompression speed.

We're currently using sequence files lzo compressed and Hive to query the data. We've decided to convert the data over to Parquet to use with Impala because we see huge query performance gains in our smaller dev environment. We'd also like to pick a future-proof option that will work well with other tools that can query/analyze a large set of data in Parquet format.

The data is constantly flowing in from flume, but once it goes through our ingestion pipeline/ETL it won't change.

I haven't been able to find much info on the web about whether to compress Parquet with splittable Snappy or non-splittable Gzip, or discussion about optimal file sizes on HDFS for each.

avatar
Champion

Based on the response , I would go for Parquet with splittable Snappy . more over 

Hive - default Compression is DeflateCodec 
Impala - default Compression is Snappy 

 

i would put a link of my response to a similar post in the community that will give you little more info.  

 

http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/Parquet-table-snappy-compressed-by-default/m-...

 

please let me know if that helps

avatar
Contributor
Thanks for the response, good info. We've decided to do some testing and profiling ourselves to have a little more confidence before we start the migration. We're going to do a much smaller data set using some variations of file size, block size, and compression and see which performs best in Impala.

avatar
Champion

@medloh Any time mate . 

avatar
Contributor

After some experimenting, here's what we found.  I'd be interested if anyone enthusiastically agrees or passionately disagrees this.

 

Tested the following scenarios:

 

1)  128mb block, 128mb file, gzip

2)  128mb block, 1gb file, gzip

3)  1gb block, 1gb file, gzip

 

4)  128mb block, 128 file, snappy

5)  128mb block, 1gb file, snappy

6)  1gb block, 1gb file, snappy

 

The worst in storage and performance seemed to be the 2 cases where the block size was much smaller than the file size in both compression formats, so strike out #2 and #5.

 

The performance for 1, 3, 4, and 6 all seemed to be very similar in the queries we tested.   But gzip used only about 60% as much storage.  So, probably going with gzip.

 

Finally we're thinking the smaller block and file size is probably the way to go to get a little more parallelism.

 

 

 

 

avatar

Gzip decompression will definitely use more CPU than snappy decompression, so I'd usually expect Gzip to give you worse performance, unless your query is limited by disk I/O (in which case smaller is better) or if your query isn't limited by scan performance.