Support Questions

gobi_subramani · ‎11-16-2016

Hi geeks,

I am creatting hive table using below commands

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY(timestamp STRING) CLUSTERED BY(user_id) INTO 25 BUCKETS;

on daily basis I am collecting records from mysql to pasting it to HDFS and creating partiton ( using add partition command ).

1. after creating patition dynamically , will 25 files create for bucketing ?

2. what will happen my unique user_id more than 25 (say 40 ) . How this will distribute to 25 buckets ?

rajkumar_singh · ‎11-16-2016

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

View solution in original post

rajkumar_singh · ‎11-16-2016

@Gobi Subramani

the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..

srai1 · ‎11-16-2016

This should be easy enough for you to test:

1. Insert values 1 to 40 for column user_id into table user_info_bucketed

2. Insert around 440 rows from 41 to 440

3. Ideally, each bucket should have about 19 rows, or around that

4. You can then check something like:

SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101;
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160;

OR you can check the physical location of the file on HDFS to determine the line count.

Cloudera Community

Support Questions

Hive partition and bucketing

Adding new columns to an already partitioned Hive ...

Problem renaming Hive partitioned table and then d...

Hive partitioning

Creating Hive Table on S3A bucket + Ranger

Get files recursively from S3 bucket

Creating HIVE partitioned tables using sqoop

HIVE - Duplicate table and merge partitions from ...

Hive - Deciding the number of buckets

Delete partition Hive error

How to access data files stored in AWS S3 buckets ...