Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Generation strategy for Parquet files

avatar
Rising Star

I have described an issue with time consuming parquet file generation in the Hive forum. See this post for a description on the environment. The question is half Impala related so I would appreciated if any Impala experts here could read that post as well.

 

https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/How-to-improve-performance-when-creating-Par...

 

I have some additional questions that are Impala specific. The environment currently has three Impala nodes with 5-10GB worth of data in each partition. The question is how I should generate the parquet files to achieve the most performance out of Impala.

 

Currently I target the parquet file size to 1 GB each. The HDFS block size is set to 256 MB for these files and I have instructed to create row groups of the same size. Surprisingly I get many more row groups. I just picked a random file and it contained 91 row groups.

 

Given our environment, where should we aim at file-size, number of row groups in each file and HDFS block size for the files? Also, if it would be more beneficiary to have fewer row-groups in each file, how can we instruct Hive to generate fewer row groups since Hive does not seem to respect the parquet.block.size option?

 

We used the Impala version bundled with CDH 5.7.1

 

Thanks in advance,

Petter

 

1 ACCEPTED SOLUTION

avatar
Contributor

Hi,

 

Looking at your Hive question, the parquet.block.size and dfs.blocksize should be honored. So, I'm not sure what's going wrong. The Hive folks should be able to help you with that.

 

I can help you with the Impala side however. 1GB Parquet files with 4 row groups (256MB each) should work just fine w.r.t optimal performance. The key is that the row group boundaries should preferrably be at block boundaries, i.e. the beginning or end of a row group shouldn't cross a block boundary. It will work fine even if it does cross block boundaries, but this would cause some remote reads which would slow down scan time.

 

However, having 4 Impala nodes for this setting would be ideal, so there is a higher chance that each row group is scanned by a different impalad. Or the easier thing to do would be to generate data as mentioned below.

 

The fastest scans for Parquet files in Impala would be to have one row group per file where the file completely fits in a block (so 256MB or less is preferrable). This however, wouldn't give you a tremendous boost in performance compared to multiple row groups per file (probably ~5% on the average case).

 

In any case, if you're able to fix the Hive Parquet file generation issue, you should start seeing faster scans through Impala.

View solution in original post

3 REPLIES 3

avatar
Contributor

Hi,

 

Looking at your Hive question, the parquet.block.size and dfs.blocksize should be honored. So, I'm not sure what's going wrong. The Hive folks should be able to help you with that.

 

I can help you with the Impala side however. 1GB Parquet files with 4 row groups (256MB each) should work just fine w.r.t optimal performance. The key is that the row group boundaries should preferrably be at block boundaries, i.e. the beginning or end of a row group shouldn't cross a block boundary. It will work fine even if it does cross block boundaries, but this would cause some remote reads which would slow down scan time.

 

However, having 4 Impala nodes for this setting would be ideal, so there is a higher chance that each row group is scanned by a different impalad. Or the easier thing to do would be to generate data as mentioned below.

 

The fastest scans for Parquet files in Impala would be to have one row group per file where the file completely fits in a block (so 256MB or less is preferrable). This however, wouldn't give you a tremendous boost in performance compared to multiple row groups per file (probably ~5% on the average case).

 

In any case, if you're able to fix the Hive Parquet file generation issue, you should start seeing faster scans through Impala.

avatar
Rising Star

Hi,

 

thank you very much for your reply! Just a follow up question.

 

Given the scenario that we target say 10 GB of data stored in gzipped parquet in each partition. We have three nodes currently but it will increase soon. From an Impala performance perspective, which of the below approaches is better?

 

- Store the data in 40 parquet files with file size = row group size = hdfs block size = 256 MB

- Store the data in 10 parquet files with file size = row group size = hdfs block size = 1 GB

- Store the data in 10 parquet files with file size 1 GB, row group size = hdfs block size = 256 MB

 

Thanks,

Petter

 

 

 

avatar
Contributor

Hi Pettax,

 

I would say option 1 and option 3 would be very similar in performance and allow for the best distribution of data across the cluster. I wouldn't opt for option 2.