Created on 11-01-2017 01:51 AM - last edited on 09-16-2022 05:28 AM by kh-asen
I have a raw data where i have a column for timestamp.
Now i want to move the file to hive and create partions based on date. Daily partitions.
From timestamp data how we will create date partions?
Created on 11-03-2017 06:53 AM - edited 11-03-2017 07:02 AM
The timestamp column is not "suitable" for a partition (unless you want thousands and thousand of partitions).
What is suitable :
- is to create an Hive table on top of the current not partitionned data,
- create a second Hive table for hosting the partitionned data (the same columns + the partition column),
- eventualy load the data from the first table to the second one using a query that will "parse" the timestamp column and extract what should be a suitable value for the partition column (for example the year or the year-and-the-month, ...).
Example :
INSERT INTO TABLE my_partitioned_table PARTITION (part_col_name) SELECT *, year(to_date(my_timestamp_column)) FROM my_not_partitioned_table;
You don't have to put the partition value in the insert statement if you enable dynamic partition in Hive.
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
And on your sample it's not working properly because you didn't parse the timestamp column, you use it as is.
Each unique value will create a partition. For a timestamps, it's almost each value that is unique.
Created 11-01-2017 08:09 PM
Created 11-02-2017 08:36 PM
Hi,
We have a timestamp in a table and we want o take date out of it then we can write a select statement to_date(timestamp column) from table name.
The question is while creating partition i dont have seperate date column in the table i have only timestamp column.
For static Partitions:
create external table hju(starttime string,h string,m string,mv double,country string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/srf/ji/cana/'
create table hjuk(h string,m string,mv double,country string)partitioned by (starttime string) location '/hji/lo/p/'
INSERT into table tblename PARTITION(starttime='2017-08-09') SELECT h,m,mv,country from tblname where to_date(starttime)='2017-08-09'
For dynamic partitions:
SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;
create table tblename parg(h string,m string,mv double,country string)partitioned by (starttime string) location '/hiloi/kil'
INSERT overwrite table tblename PARTITION(starttime) SELECT h,m,mv,country ,starttime from tblename
INFO : Starting task [Stage-2:STATS] in serial mode
INFO : Partition naparg1{starttime=2017-08-01 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=26, rawDataSize=25]
INFO : Partition arg1{starttime=2017-08-02 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
INFO : Partition aparg1{starttime=2017-08-03 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-04 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-05 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-06 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
INFO : Partition arg1{starttime=2017-08-09 18:40:03.0} sta ts: [numFiles=1, numRows=7, totalSize=172, rawDataSize=165]
Created on 11-03-2017 06:53 AM - edited 11-03-2017 07:02 AM
The timestamp column is not "suitable" for a partition (unless you want thousands and thousand of partitions).
What is suitable :
- is to create an Hive table on top of the current not partitionned data,
- create a second Hive table for hosting the partitionned data (the same columns + the partition column),
- eventualy load the data from the first table to the second one using a query that will "parse" the timestamp column and extract what should be a suitable value for the partition column (for example the year or the year-and-the-month, ...).
Example :
INSERT INTO TABLE my_partitioned_table PARTITION (part_col_name) SELECT *, year(to_date(my_timestamp_column)) FROM my_not_partitioned_table;
You don't have to put the partition value in the insert statement if you enable dynamic partition in Hive.
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;
And on your sample it's not working properly because you didn't parse the timestamp column, you use it as is.
Each unique value will create a partition. For a timestamps, it's almost each value that is unique.