Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

what is basic difference between Partitioning and Bucketing in Hive? What should be the initial design approach if we are applying it on any table?

avatar
Expert Contributor
 
1 ACCEPTED SOLUTION

avatar
Master Guru

They are actually quite different. Partitioning divides a table into subfolders that are skipped by the Optimizer based on the WHERE conditions of the table. They have a direct impact on how much data is being read.

The influence of Bucketing is more nuanced it essentially describes how many files are in each folder and has influence on a variety of Hive actions. I tried to describe it here.

https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

There is also a concept called Predicate pushdown which allows Hive ORC readers to skip parts of a ORC file based on an Index in the file, it sometimes plays together with bucketing. A good overview of this is here:

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

Finally Hive has a jira to implement bucket pruning. This means Bucket files could be ignored by the Split generation without actually having to open the files ( more like partitioning ) but this is in the future. At the moment bucketing have pretty specialized usecases.

View solution in original post

7 REPLIES 7

avatar
@Mahesh Deshmukh

I hope this link will help link

avatar
Expert Contributor

avatar
Master Mentor
@Mahesh Deshmukh

See this

https://cwiki.apache.org/confluence/display/Hive/Tutorial

It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set.

If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:

The following example illustrates the case of the page_view table that is bucketed on the userid column:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.

avatar
Expert Contributor

Thanks @Neeraj Sabharwal,But still i have one question,As per above example the table page_view is partitioned by dt and country columns,means the table has now partitions based on these 2 columns.so for bucketing case,where does these 32 buckets gets created in each partition? please clear this doubt.

Thanks!!!

avatar
Master Guru

They are actually quite different. Partitioning divides a table into subfolders that are skipped by the Optimizer based on the WHERE conditions of the table. They have a direct impact on how much data is being read.

The influence of Bucketing is more nuanced it essentially describes how many files are in each folder and has influence on a variety of Hive actions. I tried to describe it here.

https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html

There is also a concept called Predicate pushdown which allows Hive ORC readers to skip parts of a ORC file based on an Index in the file, it sometimes plays together with bucketing. A good overview of this is here:

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

Finally Hive has a jira to implement bucket pruning. This means Bucket files could be ignored by the Split generation without actually having to open the files ( more like partitioning ) but this is in the future. At the moment bucketing have pretty specialized usecases.

avatar
Expert Contributor

avatar
New Contributor

partition is two types,

1.static, 2.dynamic, splitting the data concept,

see this blog for more good example to learn

hive partition and bucketing