Created 11-16-2016 07:07 AM
Hi geeks,
I am creatting hive table using below commands
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY(timestamp STRING) CLUSTERED BY(user_id) INTO 25 BUCKETS;
on daily basis I am collecting records from mysql to pasting it to HDFS and creating partiton ( using add partition command ).
1. after creating patition dynamically , will 25 files create for bucketing ?
2. what will happen my unique user_id more than 25 (say 40 ) . How this will distribute to 25 buckets ?
Created 11-16-2016 07:21 AM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 11-16-2016 07:21 AM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 11-16-2016 07:23 AM
This should be easy enough for you to test:
1. Insert values 1 to 40 for column user_id into table user_info_bucketed
2. Insert around 440 rows from 41 to 440
3. Ideally, each bucket should have about 19 rows, or around that
4. You can then check something like:
SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 5; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 50; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 101; SELECT user_id,INPUT__FILE__NAME FROM user_info_bucketed WHERE user_id = 160;
OR you can check the physical location of the file on HDFS to determine the line count.