Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Partition Strategy for daily feeds

Partition Strategy for daily feeds

New Contributor

Is there a max limit o partitions in hive

If I create a table partitioned as below

with the partitions as (importdate,classifier, grouping,date)

where I can have 365 import dates, 1 classifier, 10 groupings, and again 365 businessdates

the numbe of partitiosn for a year is going to be huge? 365*1*10*365

HDFS structure

/importdate=2017-01-01/classifier=object/grouping=artifact/date=2017-01-01

/importdate=2017-01-02/classifier=object/grouping=entity/date=2017-01-02

...

/importdate=2017-02-01/classifier=object/grouping=artifact/date=2017-01-01

/importdate=2017-03-01/classifier=object/grouping=entiry/date=2017-01-03

4 REPLIES 4

Re: Partition Strategy for daily feeds

Guru

Two main comments: 1) that will be too many partitions and 2) consider flattened partitions over nested. (See links below).

Thus, you should concatenate your partition strings and break this up into more than one table (each table partition designed around a query pattern around where clause).

https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by....

https://community.hortonworks.com/questions/61877/the-best-approach-to-the-thousands-of-small-partit...

Re: Partition Strategy for daily feeds

New Contributor

@ Greg Keys - the dataset is not small

Each day is 3-4 million rows partitioned by classifier=object, grouping=artifact and date=2017-01-01

I could merge everything into 1 folder which would remove the importdate partition so the total partition would be reduced

to 1 (/classifier=X/) * 10 (grouping=Y) * 365 (date=Z)

However I then lose the ability if an error occurs during merge to recover easily

With the existing partitioning If an import fails I can simply blow away that folder and redo ( then re-adding the partitions using add partition)

The date is grouped to allow queries such as

select * from table where classifier = 'object' and grouping='artificact' and businesdate ='2017-01-01'

or

select * from table where businesdate ='2017-01-01'

or

select * from table where businesdate ='2017-01-01' and table_column = 'XYZ'

and of course the select count(*) from table

Re: Partition Strategy for daily feeds

@Harshad Patel

Does the import dates and the business dates correlate? Or 1 import date can have data for 365 past business dates?

Re: Partition Strategy for daily feeds

New Contributor

There's no real correlation between the import date and the businesses, which contributes to the problem