Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Hive partition and bucketing

avatar
Rising Star

Hi geeks,

I am creatting hive table using below commands

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY(timestamp STRING) CLUSTERED BY(user_id) INTO 25 BUCKETS;

on daily basis I am collecting records from mysql to pasting it to HDFS and creating partiton ( using add partition command ).

1. after creating patition dynamically , will 25 files create for bucketing ?

2. what will happen my unique user_id more than 25 (say 40 ) . How this will distribute to 25 buckets ?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

View solution in original post

2 REPLIES 2

avatar
Super Guru

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

avatar
Guru

This should be easy enough for you to test:

1. Insert values 1 to 40 for column user_id into table user_info_bucketed

2. Insert around 440 rows from 41 to 440

3. Ideally, each bucket should have about 19 rows, or around that

4. You can then check something like:

SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160;

OR you can check the physical location of the file on HDFS to determine the line count.