Created 03-18-2016 09:01 PM
Created 03-19-2016 10:20 AM
They are actually quite different. Partitioning divides a table into subfolders that are skipped by the Optimizer based on the WHERE conditions of the table. They have a direct impact on how much data is being read.
The influence of Bucketing is more nuanced it essentially describes how many files are in each folder and has influence on a variety of Hive actions. I tried to describe it here.
https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html
There is also a concept called Predicate pushdown which allows Hive ORC readers to skip parts of a ORC file based on an Index in the file, it sometimes plays together with bucketing. A good overview of this is here:
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Finally Hive has a jira to implement bucket pruning. This means Bucket files could be ignored by the Split generation without actually having to open the files ( more like partitioning ) but this is in the future. At the moment bucketing have pretty specialized usecases.
Created 03-18-2016 09:09 PM
I hope this link will help link
Created 03-20-2016 07:36 AM
thanks @Divakar Annapureddy
Created 03-19-2016 02:12 AM
See this
https://cwiki.apache.org/confluence/display/Hive/Tutorial
It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set.
If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:
The following example illustrates the case of the page_view table that is bucketed on the userid column:
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; |
In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.
Created 03-20-2016 07:35 AM
Thanks @Neeraj Sabharwal,But still i have one question,As per above example the table page_view is partitioned by dt and country columns,means the table has now partitions based on these 2 columns.so for bucketing case,where does these 32 buckets gets created in each partition? please clear this doubt.
Thanks!!!
Created 03-19-2016 10:20 AM
They are actually quite different. Partitioning divides a table into subfolders that are skipped by the Optimizer based on the WHERE conditions of the table. They have a direct impact on how much data is being read.
The influence of Bucketing is more nuanced it essentially describes how many files are in each folder and has influence on a variety of Hive actions. I tried to describe it here.
https://community.hortonworks.com/questions/23103/hive-deciding-the-number-of-buckets.html
There is also a concept called Predicate pushdown which allows Hive ORC readers to skip parts of a ORC file based on an Index in the file, it sometimes plays together with bucketing. A good overview of this is here:
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Finally Hive has a jira to implement bucket pruning. This means Bucket files could be ignored by the Split generation without actually having to open the files ( more like partitioning ) but this is in the future. At the moment bucketing have pretty specialized usecases.
Created 03-20-2016 07:36 AM
thanks @Benjamin Leonhardi
Created 08-26-2016 07:17 AM
partition is two types,
1.static, 2.dynamic, splitting the data concept,
see this blog for more good example to learn