Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive Bucket clarification

avatar
Expert Contributor

CREATETABLEbucketed_user( firstnameVARCHAR(64), lastnameVARCHAR(64), addressSTRING, cityVARCHAR(64), stateVARCHAR(64), postSTRING, phone1VARCHAR(64), phone2STRING, emailSTRING, webSTRING ) COMMENT'A bucketed sorted user table' PARTITIONEDBY(countryVARCHAR(64)) CLUSTEREDBY(state)SORTEDBY(city)INTO 32BUCKETS STORED ASSEQUENCEFILE;

could anybody tell what is the purpouse of CLUSTEREDBY(state)SORTEDBY(city) in bucket table creation?

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
7 REPLIES 7

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor

Let us say i have 4 countries.each country has 10 states.Totally 32 buckets will be created and 32 files will be created on HDFS.But i have confusion with partition how many folders will be created where 32 files will be created .is it within each partition?

avatar

In Hive, each partition is physically a separate subdirectory under the table directory. Buckets would then be physically represented as separate files within those subdirectories. Using your example above where you have 4 countries and 32 buckets, this would result in 4 subdirectories under the table directory, each containing 32 files.

avatar
Expert Contributor

I tried example in the link:http://hadooptutorial.info/bucketing-in-hive/

I created four folders

country=AUdir2016-04-28 00:03rwxr-xr-xnareshsupergroup
country=CAdir2016-04-28 00:03rwxr-xr-xnareshsupergroup
country=UKdir2016-04-28 00:03rwxr-xr-xnareshsupergroup
country=USdir2016-04-28 00:03rwxr-xr-xnareshsupergroup
country=countrydir2016-04-28 00:03rwxr-xr-xnareshsupergroup

each folder contains 32 files.

clarifications:

1)How to select Bucket1 files in folder country=AU?

2)How to select Bucket1 files in folder country=country? and also why this folder is created it is partitioned by country so four folders should be created and what is this fifth folder?

avatar

hi @Tom McCuch and @vamsi valiveti. Just wanted to clarify - it is legal to have two bucketed tables where the number of buckets in one table is a multiple of the number of buckets in the other table, but for pragmatic performance reasons it is best to have the number of buckets be the same. IMHO If you are going to bucket your data, you are doing it because you need a more efficient join - and having a non-matching number of buckets removes that ability to do a sort-merge bucket join.

See this post on bucket join versus sort-merge bucket join. it's very good. http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables

avatar
Expert Contributor

1)How to select Bucket1 files in folder country=AU?could anybody provide sql query for that?

avatar
Contributor

You can do a table sample.

Select * from bucketed_user tablesample(bucket 1 out of 2 on state) where country = AU;