Support Questions

Find answers, ask questions, and share your expertise

Best Pratices for Hive Partitioning especially by Date

avatar

My question is how can we specify partitioning by month on a date field ? If our case, I don’t think partitioning on dates makes sense as we will have approximately 100,000 partitions and related files for this one table alone. Also, is there a severe performance penalty if there is a high number of partitions ? Does Hortonworks have a manual that goes over Hive table design tradeoffs, implications and limitations ?

1 ACCEPTED SOLUTION

avatar
Rising Star

Single String vs Nested Partitions

When creating partitions by date it is almost always more effective to partition by a single string of ‘YYYY-MM-DD’ rather than use a multi-depth partition with The Year, Months, and Days all as their own. The advantage to using the single string approach allows for more SQL operators to be utilized such as LIKE, IN, and BETWEEN which cannot be as easily used if one utilizes nested partitions.

Example:

Dates to Select: 2015-01-01, 2015-02-03,2016-01-01

Tables

Table A, Partitioned by DateStamp String as YYYY-MM-DD

Table B, Partitioned by YEAR INT, Month INT, Day INT

Queries on Table A

All Dates

SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’)

Only 2015

SELECT * FROM TableA WHERE DateStamp LIKE ‘2015-%’

Only 2015 and February

SELECT * FROM TableA WHERE DateStamp LIKE ‘2015-02-%’

All Days that start/end with a 5

SELECT * FROM TableA WHERE DateStamp LIKE ‘%-%-%5’

All Days Between 2015-01-01 and 2015-03-01

SELECT * FROM TableA WHERE DateStamp BETWEEN ‘2015-01-01’ AND ‘2015-03-01’

Queries on Table B

All Dates

SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)

Only 2015

SELECT * FROM TableB WHERE YEAR=2015

Only 2015 and February

SELECT * FROM TableB WHERE YEAR=2015 and MONTH=02

All Days that start/end with a 5

SELECT * FROM TableB WHERE DAY LIKE ‘%5’

All Days Between 2015-01-01 and 2015-03-01

SELECT * FROM TableA WHERE YEAR=2015 and ((MONTH=01 OR MONTH=02) OR (MONTH =03 and DAY =01))

View solution in original post

5 REPLIES 5

avatar
Contributor

One way is to generate hash using the date and 1 or 2 primary keys and use it as partition column which should reduce the number of partitions.

avatar
Master Guru

100000 partitions would be too much. However what do you mean with "related files". A good best practice is to keep partitions under a couple thousand. ( till we have the HBase backed metastore )

However I would normally think date partition should be at most a couple thousand. Even ten years daily partitions would be only 3650. Just choose the granularity level accordingly.

Its hard to give a general Design however since it depends on your queries and data. Lots of partitions can be ok if you always only query ONE partition ( hive metastore will do a key lookup of the partition ). However if you would have tens of thousands of partitions and do a full table scan? Then you run into severe problems because it will take a couple seconds in the hiveserver and can fail in Tez because the Execution graph gets too big.

As usual these are things that need to be looked at in a ( and I hate this word ) holistic manner. Hard to give more than simple standard rules without understanding your data model and the queries you want to run.

The hash approach amalay mentioned can be good as well if you always only need to query one partition. I have seen a setup with a two level partitioning by date and customerid % xxx which gave hive the ability to only select a small subset of the data. However hopefully we will have bucket pruning soon which could be a good alternative.

I know I am shameless about it but for table setup I have made a pretty good reference here. It includes hints for optimal predicate pushdown as well

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

avatar

Partitioning by Month is very acceptable, especially if the data comes in on a monthly basis.

@Joseph Niemiec has written a great writeup on why you should use single Hive partitions like YYYYMMDD, YYYY-MM-DD, YYYYMM, YYYY-MM. Do this instead of nested partitions like YYYY/MM/DD or YYYY/MM.

The reason is fairly simple - it makes for simpler querying with LIKE, IN and BETWEEN... and the the Hive optimizer can do partition pruning on those queries.

avatar
Rising Star

Single String vs Nested Partitions

When creating partitions by date it is almost always more effective to partition by a single string of ‘YYYY-MM-DD’ rather than use a multi-depth partition with The Year, Months, and Days all as their own. The advantage to using the single string approach allows for more SQL operators to be utilized such as LIKE, IN, and BETWEEN which cannot be as easily used if one utilizes nested partitions.

Example:

Dates to Select: 2015-01-01, 2015-02-03,2016-01-01

Tables

Table A, Partitioned by DateStamp String as YYYY-MM-DD

Table B, Partitioned by YEAR INT, Month INT, Day INT

Queries on Table A

All Dates

SELECT * FROM TableA WHERE DateStamp IN (‘2015-01-01’, ‘2015-02-03’, ‘2016-01-01’)

Only 2015

SELECT * FROM TableA WHERE DateStamp LIKE ‘2015-%’

Only 2015 and February

SELECT * FROM TableA WHERE DateStamp LIKE ‘2015-02-%’

All Days that start/end with a 5

SELECT * FROM TableA WHERE DateStamp LIKE ‘%-%-%5’

All Days Between 2015-01-01 and 2015-03-01

SELECT * FROM TableA WHERE DateStamp BETWEEN ‘2015-01-01’ AND ‘2015-03-01’

Queries on Table B

All Dates

SELECT * FROM TableB WHERE (YEAR=2015 AND MONTH=01 AND DAY=01) OR (YEAR=2015 AND MONTH=02 AND DAY=03) OR (YEAR=2016 AND MONTH=01 AND DAY=01)

Only 2015

SELECT * FROM TableB WHERE YEAR=2015

Only 2015 and February

SELECT * FROM TableB WHERE YEAR=2015 and MONTH=02

All Days that start/end with a 5

SELECT * FROM TableB WHERE DAY LIKE ‘%5’

All Days Between 2015-01-01 and 2015-03-01

SELECT * FROM TableA WHERE YEAR=2015 and ((MONTH=01 OR MONTH=02) OR (MONTH =03 and DAY =01))

avatar
New Contributor

@Joseph Niemiec How can I do this command " select * from table where date <= '2017-12-08' " in nest partitions form?
In case the table is partitioned by day,month,year