Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practices for partition tables in Impala 3.2.0

avatar
New Contributor

Hello,

We need to partition 5 hive tables with more than 100 million records each. Tables have multiple columns to partition with integer and string data allowing less than 1000 partitions per column. Could we combine several of these fields to partition the tables? would there be a restriction on the maximum number of combined partitions? Should they have a specific order according to the number of partitions in each table? for example, first partition by year (10 partitions) , then by country (50 partitions) and then by city name (500 partitions) ? Must the combination of partitions be less than a particular number?

Best Regards,

H.M

 

1 ACCEPTED SOLUTION

avatar

Hi @wcg_hdp_manager ,

Please review the Impala partitioning best practices guide:

https://docs.cloudera.com/best-practices/latest/impala-partitioning/topics/bp-impala-partitioning-co...

And CDP 7.1.8 Impala partitioning guide:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/impala-reference/topics/impala-partition.html 

Do not partition your table if you do not have a good reason to do so. The number of records (100m) itself is not necessarily is a reason. You need to know what kind of queries you will have on your table (do you always know in WHERE clause one or more columns so Impala can take advantage of partition pruning? if not, then anyway the whole dataset might be scanned), how you ingest those (do you load new partitions each day? or some other factors?).

Creating too many partitions will likely implicitly create too many small files instead of less but bigger files. Processing data which are in more datafiles is less efficient and you can put a stress on the HDFS NameNode (which needs to keep track of the many datafiles) if that is going to be a general trend.

Hope this helps,

 Miklos

View solution in original post

3 REPLIES 3

avatar
Community Manager

@wcg_hdp_manager Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive expert @mszurap who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar

Hi @wcg_hdp_manager ,

Please review the Impala partitioning best practices guide:

https://docs.cloudera.com/best-practices/latest/impala-partitioning/topics/bp-impala-partitioning-co...

And CDP 7.1.8 Impala partitioning guide:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/impala-reference/topics/impala-partition.html 

Do not partition your table if you do not have a good reason to do so. The number of records (100m) itself is not necessarily is a reason. You need to know what kind of queries you will have on your table (do you always know in WHERE clause one or more columns so Impala can take advantage of partition pruning? if not, then anyway the whole dataset might be scanned), how you ingest those (do you load new partitions each day? or some other factors?).

Creating too many partitions will likely implicitly create too many small files instead of less but bigger files. Processing data which are in more datafiles is less efficient and you can put a stress on the HDFS NameNode (which needs to keep track of the many datafiles) if that is going to be a general trend.

Hope this helps,

 Miklos

avatar
Community Manager

@wcg_hdp_manager Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: