Created 11-16-2016 07:07 AM
Hi geeks,
I am creatting hive table using below commands
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY(timestamp STRING) CLUSTERED BY(user_id) INTO 25 BUCKETS;
on daily basis I am collecting records from mysql to pasting it to HDFS and creating partiton ( using add partition command ).
1. after creating patition dynamically , will 25 files create for bucketing ?
2. what will happen my unique user_id more than 25 (say 40 ) . How this will distribute to 25 buckets ?
Created 11-16-2016 07:21 AM
the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..
Created 11-16-2016 07:21 AM
the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets.say for example if user_id (unique value 40)were an int, and there were 25 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.user_id 26 will go in bucket 1 and so on..
Created 11-16-2016 07:23 AM
This should be easy enough for you to test:
1. Insert values 1 to 40 for column user_id into table user_info_bucketed
2. Insert around 440 rows from 41 to 440
3. Ideally, each bucket should have about 19 rows, or around that
4. You can then check something like:
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160;
OR you can check the physical location of the file on HDFS to determine the line count.