Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

what is the difference between partition(Static and Dynamic) and bucketing in Hive??

what is the difference between partition(Static and Dynamic) and bucketing in Hive??

New Contributor

I wanted to know the main difference between Partitioning and bucketing in Hive

I read that there are 2 concepts in partitioning i,e Static and Dynamic

In static the files are partitioned manually like years (2000 - 2014) we need to partition 2000.csv, 2001.csv etc

where as in Dynamic 2 SET commands instruct hive to change our query to dynamically load partitions (SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict;)

Then in that case how is bucketing different from dynamic partitioning??

 

4 REPLIES 4

Re: what is the difference between partition(Static and Dynamic) and bucketing in Hive??

Contributor

Both partition/bucket distribute a subset of the table's data to a subdirectory.  Here's the differences:

 

1.  You can specify multiple partition columns, for example if region and country are partition columns:

 

<table>/country=USA/state=CA/data

<table>/country=USA/state=TX/data

 

I believe you can bucket only by one column, for example:

 

<table>/bucket0/data

<table>/bucket1/data

 

2.  The number of partition sub-directories is the number of unique values of the partition columns of the data.

 

In contrast, the number of bucket sub-directories is fixed by 'clustered by bucket_col into x buckets' clause.  Assigning a row to a bucket is done by hash(bucket_col_val) % n.

 

Thanks,

Szehon

Re: what is the difference between partition(Static and Dynamic) and bucketing in Hive??

Explorer

Hi.

i need some idea about below  data:-

name   id(random num)   gender   dob 

ram      101                      M            19940215

rahim.. 103..................................................

shyam..161...................................................

.......................................................................

.......................................................................

question:-

in which column we can design partitions and buckets?

 

 

 

 

Thanks

HadoopHelp

Re: what is the difference between partition(Static and Dynamic) and bucketing in Hive??

New Contributor

To decide column for partitioning & bucketing we need to look in cardinality of given field-

 

Cardinality - Or Distinct values in given field.

Eg- In table you have below fields country,employeeID

 

Now Country will have max 300 values (i.e country code for each country)

& EmployeeID will have millions of distinct keys.

 

Now If you partition by EmployeeID then millions of folders will be created in HDFS. Which is not the feasiable solution for lookup. Alongside it'll add overhead on Hive Metastore to maintain large number of Partitions.

So in this case we'll use Bucketing. So it'll have fixed number of buckets across which data is evenly distributed.

 

In case of Country we can use Partitioning. Since it'll only create 300 folders max in HDFS. Maintaining which is not overhead to Metastore.

 

Summary-

If Cardinality (Distinct Values) for given field is low (Country) then go for Partitioning and if it's more (EmployeeID) then go for bucketing.

 

 

Re: what is the difference between partition(Static and Dynamic) and bucketing in Hive??

New Contributor

Static Partition in Hive

Insert input data files individually into a partition table is Static Partition

Usually when loading files (big files) into Hive Tables static partitions are preferred

Static Partition saves your time in loading data compared to dynamic partition

You “statically” add a partition in table and move the file into the partition of the table.

We can alter the partition in static partition

Dynamic Partition in Hive

single insert to partition table is known as dynamic partition

Usually dynamic partition load the data from non partitioned table

Dynamic Partition takes more time in loading data compared to static partition

When you have large data stored in a table then Dynamic partition is suitable.

If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable