Reply
Explorer
Posts: 6
Registered: ‎12-01-2015

Recommended file size for Impala Parquet files?

[ Edited ]

Looking for some guidance on the size and compression of Parquet files for use in Impala.  We have written a spark program that creates our Parquet files and we can control the size and compression of the files (Snappy, Gzip, etc).  Now we just need to make a decision on their size and compression.

 

I've read Snappy is splittable, and you should make your files bigger than your HDFS block size.  So if our block size is 128mb, would 1GB snappy compressed parquet files be a good choice?  Is it possible to increase or decrease the amount of compression with Snappy?

 

Gzip gives us better compression.  But since it's not splittable, what should we set the max file size to if using gzip? Should it be no more than our HDFS block size?

 

We will have lots of partitions and some of them will be large, hundreds of GB or bigger.  The total amount of data will be hundreds of Terabytes or more.

Champion
Posts: 601
Registered: ‎05-16-2016

Re: Recommended file size for Impala Parquet files?

are you looking for fast compression or decompression ? 

are you looking for disk space consumption ?

 

Explorer
Posts: 6
Registered: ‎12-01-2015

Re: Recommended file size for Impala Parquet files?

[ Edited ]

Good question, I suppose I'm looking for an optimal balance of both, maybe with a lean toward performance rather than disk space usage.  We don't care about compression speed, but do care about compression rate and decompression speed.

We're currently using sequence files lzo compressed and Hive to query the data. We've decided to convert the data over to Parquet to use with Impala because we see huge query performance gains in our smaller dev environment. We'd also like to pick a future-proof option that will work well with other tools that can query/analyze a large set of data in Parquet format.

The data is constantly flowing in from flume, but once it goes through our ingestion pipeline/ETL it won't change.

I haven't been able to find much info on the web about whether to compress Parquet with splittable Snappy or non-splittable Gzip, or discussion about optimal file sizes on HDFS for each.

Champion
Posts: 601
Registered: ‎05-16-2016

Re: Recommended file size for Impala Parquet files?

Based on the response , I would go for Parquet with splittable Snappy . more over 

Hive - default Compression is DeflateCodec 
Impala - default Compression is Snappy 

 

i would put a link of my response to a similar post in the community that will give you little more info.  

 

http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/Parquet-table-snappy-compressed-by-default/m-...

 

please let me know if that helps

Explorer
Posts: 6
Registered: ‎12-01-2015

Re: Recommended file size for Impala Parquet files?

Thanks for the response, good info. We've decided to do some testing and profiling ourselves to have a little more confidence before we start the migration. We're going to do a much smaller data set using some variations of file size, block size, and compression and see which performs best in Impala.
Champion
Posts: 601
Registered: ‎05-16-2016

Re: Recommended file size for Impala Parquet files?

@medloh Any time mate . 

Announcements