Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Hive partition and bucketing

Hi geeks,

I am creatting hive table using below commands

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY(timestamp STRING) CLUSTERED BY(user_id) INTO 25 BUCKETS;

on daily basis I am collecting records from mysql to pasting it to HDFS and creating partiton ( using add partition command ).

1. after creating patition dynamically , will 25 files create for bucketing ?

2. what will happen my unique user_id more than 25 (say 40 ) . How this will distribute to 25 buckets ?

1 ACCEPTED SOLUTION

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

View solution in original post

2 REPLIES 2

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

Guru

This should be easy enough for you to test:

1. Insert values 1 to 40 for column user_id into table user_info_bucketed

2. Insert around 440 rows from 41 to 440

3. Ideally, each bucket should have about 19 rows, or around that

4. You can then check something like:

SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160;

OR you can check the physical location of the file on HDFS to determine the line count.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.