Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Hive partition scenario and how it impacts performance

New Contributor

Hello, 

I want to ask regarding the hive partitions numbers and how they will impact performance. 

let me reflect this on a real example; 

I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns as following:

  1. start_date
  2. type_a
  3. type_b
  4. tybe_c
  5. datasource

for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).

 

Mainly the queries that will select from this table

  • 50% of them shall use the exact order of partitions..
  • 25% shall use only 1-3 partitions and not using the other 2.
  • 25% only using 1st partition

So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).

 

0 REPLIES 0