Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Generation strategy for Parquet files

avatar
Rising Star

I have described an issue with time consuming parquet file generation in the Hive forum. See this post for a description on the environment. The question is half Impala related so I would appreciated if any Impala experts here could read that post as well.

 

https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/How-to-improve-performance-when-creating-Par...

 

I have some additional questions that are Impala specific. The environment currently has three Impala nodes with 5-10GB worth of data in each partition. The question is how I should generate the parquet files to achieve the most performance out of Impala.

 

Currently I target the parquet file size to 1 GB each. The HDFS block size is set to 256 MB for these files and I have instructed to create row groups of the same size. Surprisingly I get many more row groups. I just picked a random file and it contained 91 row groups.

 

Given our environment, where should we aim at file-size, number of row groups in each file and HDFS block size for the files? Also, if it would be more beneficiary to have fewer row-groups in each file, how can we instruct Hive to generate fewer row groups since Hive does not seem to respect the parquet.block.size option?

 

We used the Impala version bundled with CDH 5.7.1

 

Thanks in advance,

Petter

 

1 ACCEPTED SOLUTION

avatar
Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
3 REPLIES 3

avatar
Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Rising Star

Hi,

 

thank you very much for your reply! Just a follow up question.

 

Given the scenario that we target say 10 GB of data stored in gzipped parquet in each partition. We have three nodes currently but it will increase soon. From an Impala performance perspective, which of the below approaches is better?

 

- Store the data in 40 parquet files with file size = row group size = hdfs block size = 256 MB

- Store the data in 10 parquet files with file size = row group size = hdfs block size = 1 GB

- Store the data in 10 parquet files with file size 1 GB, row group size = hdfs block size = 256 MB

 

Thanks,

Petter

 

 

 

avatar
Contributor

Hi Pettax,

 

I would say option 1 and option 3 would be very similar in performance and allow for the best distribution of data across the cluster. I wouldn't opt for option 2.