Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive partition and bucketing

Solved Go to solution

Hive partition and bucketing

Hi geeks,

I am creatting hive table using below commands

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY(timestamp STRING) CLUSTERED BY(user_id) INTO 25 BUCKETS;

on daily basis I am collecting records from mysql to pasting it to HDFS and creating partiton ( using add partition command ).

1. after creating patition dynamically , will 25 files create for bucketing ?

2. what will happen my unique user_id more than 25 (say 40 ) . How this will distribute to 25 buckets ?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Hive partition and bucketing

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

View solution in original post

2 REPLIES 2
Highlighted

Re: Hive partition and bucketing

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

View solution in original post

Highlighted

Re: Hive partition and bucketing

Guru

This should be easy enough for you to test:

1. Insert values 1 to 40 for column user_id into table user_info_bucketed

2. Insert around 440 rows from 41 to 440

3. Ideally, each bucket should have about 19 rows, or around that

4. You can then check something like:

SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160;

OR you can check the physical location of the file on HDFS to determine the line count.

Don't have an account?
Coming from Hortonworks? Activate your account here