Support Questions

Find answers, ask questions, and share your expertise

Hive partition - partition column as part of the data?

avatar
Rising Star

i have a table with two string columns and one datetime column (which is also defined as string datatype).

I want to partition the table on monthly basis ie month(the datetime column).

So I did the below :

create table test_part_tbl (id string, cd string, dttm string) partitioned by (mth string);

insert into test_part_tbl partition(mth) select id, cd, create_dt, month(create_dt) from real_table;

hive> select * from test_part_tbl ;

OK

test_part_tbl.id test_part_tbl.cd test_part_tbl.dttm test_part_tbl.mth

id1 cd1 2018-10-24 10

id2 cd1 2018-10-24 10

Time taken: 0.13 seconds, Fetched: 2 row(s)

So the month ie "10" is actually appearing as part of the table data. Is that correct?

Is it possible to partition the table as above and not have the partition column/value as part of the table data? ie when querying can't I use "month(dttm)" and search based on month?

Appreciate the insights.

1 ACCEPTED SOLUTION

avatar
Master Guru
@n c

So the month ie "10" is actually appearing as part of the table data. Is that correct?

Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.

Partitions are going to boost the query performance when we are using partition column in out where clause.

Example:

if you want to count number of records are in mth=10 then

select count(*) from test_par_tbl where mth=10;

Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.

2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan

select count(*) from test_par_tbl where month(create_dt)=10;

Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.

Is it possible to partition the table as above and not have the partition column/value as part of the table data?

This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.

If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.

View solution in original post

8 REPLIES 8

avatar
Master Guru
@n c

So the month ie "10" is actually appearing as part of the table data. Is that correct?

Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.

Partitions are going to boost the query performance when we are using partition column in out where clause.

Example:

if you want to count number of records are in mth=10 then

select count(*) from test_par_tbl where mth=10;

Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.

2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan

select count(*) from test_par_tbl where month(create_dt)=10;

Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.

Is it possible to partition the table as above and not have the partition column/value as part of the table data?

This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.

If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.

avatar
Rising Star

The below hive performance parameter - is it usually set within a map reduce program to be set at the time of execution :

SET hive.exec.parallel=true

Or can it be set at the global level in Ambari?

Appreciate the feedback.

avatar
Explorer

login to hive server and run the above command

avatar
Master Guru

@n c

Once you white list the param in ambari then you are able to set the parameter in hive cli.

avatar
Rising Star

actually I am working with cloudera now and i dont see hive.exec.parallel as a configurable option in cloudera manager.

avatar
Rising Star

if i partition a table by year - can i further bucket it by month?

so the idea is the year will be the top level and months will be at a level beneath it.

so the directory structure would be :

2018 -> 1, 2, 3 ... 12

2019 -> 1, 2, 3 ... 12

Is this what bucketing is about? Or should i be doing this someway with partitions itself?

Appreciate the insights.

avatar
Rising Star

I tried this but wouldn't work :

create table test_part_bkt_tbl (id string, cd string, dttm string) partitioned by (yr string) clustered by (month(dttm)) into 12 buckets;

avatar
Rising Star

ok, this can be done simply as : partitioned by (yr string, mth string).

tks.