- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive partition - partition column as part of the data?
- Labels:
-
Apache Hive
Created ‎11-14-2018 07:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i have a table with two string columns and one datetime column (which is also defined as string datatype).
I want to partition the table on monthly basis ie month(the datetime column).
So I did the below :
create table test_part_tbl (id string, cd string, dttm string) partitioned by (mth string);
insert into test_part_tbl partition(mth) select id, cd, create_dt, month(create_dt) from real_table;
hive> select * from test_part_tbl ;
OK
test_part_tbl.id test_part_tbl.cd test_part_tbl.dttm test_part_tbl.mth
id1 cd1 2018-10-24 10
id2 cd1 2018-10-24 10
Time taken: 0.13 seconds, Fetched: 2 row(s)
So the month ie "10" is actually appearing as part of the table data. Is that correct?
Is it possible to partition the table as above and not have the partition column/value as part of the table data? ie when querying can't I use "month(dttm)" and search based on month?
Appreciate the insights.
Created ‎11-15-2018 02:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the month ie "10" is actually appearing as part of the table data. Is that correct?
Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.
Partitions are going to boost the query performance when we are using partition column in out where clause.
Example:
if you want to count number of records are in mth=10 then
select count(*) from test_par_tbl where mth=10;
Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.
2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan
select count(*) from test_par_tbl where month(create_dt)=10;
Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.
Is it possible to partition the table as above and not have the partition column/value as part of the table data?
This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.
If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.
Created ‎11-15-2018 02:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the month ie "10" is actually appearing as part of the table data. Is that correct?
Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.
Partitions are going to boost the query performance when we are using partition column in out where clause.
Example:
if you want to count number of records are in mth=10 then
select count(*) from test_par_tbl where mth=10;
Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.
2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan
select count(*) from test_par_tbl where month(create_dt)=10;
Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.
Is it possible to partition the table as above and not have the partition column/value as part of the table data?
This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.
If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.
Created ‎11-15-2018 04:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The below hive performance parameter - is it usually set within a map reduce program to be set at the time of execution :
SET hive.exec.parallel=true
Or can it be set at the global level in Ambari?
Appreciate the feedback.
Created ‎11-15-2018 04:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
login to hive server and run the above command
Created ‎11-15-2018 05:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Once you white list the param in ambari then you are able to set the parameter in hive cli.
Created ‎11-15-2018 05:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
actually I am working with cloudera now and i dont see hive.exec.parallel as a configurable option in cloudera manager.
Created ‎11-16-2018 05:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if i partition a table by year - can i further bucket it by month?
so the idea is the year will be the top level and months will be at a level beneath it.
so the directory structure would be :
2018 -> 1, 2, 3 ... 12
2019 -> 1, 2, 3 ... 12
Is this what bucketing is about? Or should i be doing this someway with partitions itself?
Appreciate the insights.
Created ‎11-16-2018 06:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried this but wouldn't work :
create table test_part_bkt_tbl (id string, cd string, dttm string) partitioned by (yr string) clustered by (month(dttm)) into 12 buckets;
Created ‎11-16-2018 06:18 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, this can be done simply as : partitioned by (yr string, mth string).
tks.
