Created 11-14-2018 07:13 PM
i have a table with two string columns and one datetime column (which is also defined as string datatype).
I want to partition the table on monthly basis ie month(the datetime column).
So I did the below :
create table test_part_tbl (id string, cd string, dttm string) partitioned by (mth string);
insert into test_part_tbl partition(mth) select id, cd, create_dt, month(create_dt) from real_table;
hive> select * from test_part_tbl ;
OK
test_part_tbl.id test_part_tbl.cd test_part_tbl.dttm test_part_tbl.mth
id1 cd1 2018-10-24 10
id2 cd1 2018-10-24 10
Time taken: 0.13 seconds, Fetched: 2 row(s)
So the month ie "10" is actually appearing as part of the table data. Is that correct?
Is it possible to partition the table as above and not have the partition column/value as part of the table data? ie when querying can't I use "month(dttm)" and search based on month?
Appreciate the insights.
Created 11-15-2018 02:03 AM
So the month ie "10" is actually appearing as part of the table data. Is that correct?
Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.
Partitions are going to boost the query performance when we are using partition column in out where clause.
Example:
if you want to count number of records are in mth=10 then
select count(*) from test_par_tbl where mth=10;
Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.
2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan
select count(*) from test_par_tbl where month(create_dt)=10;
Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.
Is it possible to partition the table as above and not have the partition column/value as part of the table data?
This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.
If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.
Created 11-15-2018 02:03 AM
So the month ie "10" is actually appearing as part of the table data. Is that correct?
Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.
Partitions are going to boost the query performance when we are using partition column in out where clause.
Example:
if you want to count number of records are in mth=10 then
select count(*) from test_par_tbl where mth=10;
Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.
2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan
select count(*) from test_par_tbl where month(create_dt)=10;
Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.
Is it possible to partition the table as above and not have the partition column/value as part of the table data?
This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.
If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.
Created 11-15-2018 04:42 PM
The below hive performance parameter - is it usually set within a map reduce program to be set at the time of execution :
SET hive.exec.parallel=true
Or can it be set at the global level in Ambari?
Appreciate the feedback.
Created 11-15-2018 04:49 PM
login to hive server and run the above command
Created 11-15-2018 05:17 PM
Once you white list the param in ambari then you are able to set the parameter in hive cli.
Created 11-15-2018 05:53 PM
actually I am working with cloudera now and i dont see hive.exec.parallel as a configurable option in cloudera manager.
Created 11-16-2018 05:54 PM
if i partition a table by year - can i further bucket it by month?
so the idea is the year will be the top level and months will be at a level beneath it.
so the directory structure would be :
2018 -> 1, 2, 3 ... 12
2019 -> 1, 2, 3 ... 12
Is this what bucketing is about? Or should i be doing this someway with partitions itself?
Appreciate the insights.
Created 11-16-2018 06:11 PM
I tried this but wouldn't work :
create table test_part_bkt_tbl (id string, cd string, dttm string) partitioned by (yr string) clustered by (month(dttm)) into 12 buckets;
Created 11-16-2018 06:18 PM
ok, this can be done simply as : partitioned by (yr string, mth string).
tks.