Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How bucketing helps in case of more than two tables ?

avatar
New Contributor

Hello,

 

I have a question regarding hive bucketed tables (bucketed only no partitions) optimization.

 

Now, I know that if we have table A and table B and we want to join them using A.COL1 and B.COL1 ( A.COL1 = B.COL1), we should bucket both table A and B on col1 into same no of buckets or its multiple.

 

But what if we have more than 2 tables ?

 

for example, I have table A and I want to join it on table B and table C.

 

table A is joined with table B using COL1 (A.COL1 = B.COL1) and table A is joined with table C using COL2 (A.COL2 = C.COL2 )

 

what columns should I cluster by for table A ??? is it bucket by COL1 and COL2 (clustered by col1,col2)?

 

 

in summary, how can I optimize one table if it is joined with more than one table using buckets only.

 

Thanks in advance.

 

 

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hello @Soa 

Hive partition divides the table into a number of partitions and these partitions can be further subdivided into more manageable parts known as Buckets or Clusters. The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.

 

The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket. Here, CLUSTERED BY clause is used to divide the table into buckets. each partition will be created as a directory. But in Hive Buckets, each bucket will be created as a file. Bucketing can also be done even without partitioning on Hive tables.

 

Bucketed tables allow much more efficient sampling than the non-bucketed tables. Allowing queries on a section of data for testing and debugging purpose when the original data sets are very huge. Here, the user can fix the size of buckets according to the need. This concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.

 

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

1 REPLY 1

avatar
Master Collaborator

Hello @Soa 

Hive partition divides the table into a number of partitions and these partitions can be further subdivided into more manageable parts known as Buckets or Clusters. The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.

 

The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket. Here, CLUSTERED BY clause is used to divide the table into buckets. each partition will be created as a directory. But in Hive Buckets, each bucket will be created as a file. Bucketing can also be done even without partitioning on Hive tables.

 

Bucketed tables allow much more efficient sampling than the non-bucketed tables. Allowing queries on a section of data for testing and debugging purpose when the original data sets are very huge. Here, the user can fix the size of buckets according to the need. This concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.

 

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.