Support Questions

Find answers, ask questions, and share your expertise

Hive partitions based on date from timestamp

avatar

I have a raw data where i have a column for timestamp.

 

Now i want to move the file to hive and create partions based on date. Daily partitions.

 

 

From timestamp data how we will create date partions?

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

The timestamp column is not "suitable" for a partition (unless you want thousands and thousand of partitions).

 

What is suitable :

- is to create an Hive table on top of the current not partitionned data,

- create a second Hive table for hosting the partitionned data (the same columns + the partition column),

- eventualy load the data from the first table to the second one using a query that will "parse" the timestamp column and extract what should be a suitable value for the partition column (for example the year or the year-and-the-month, ...).

 

Example :

 

INSERT INTO TABLE my_partitioned_table PARTITION (part_col_name) SELECT *, year(to_date(my_timestamp_column)) FROM my_not_partitioned_table;

 

You don't have to put the partition value in the insert statement if you enable dynamic partition in Hive.

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;  

 

And on your sample it's not working properly because you didn't parse the timestamp column, you use it as is.

Each unique value will create a partition. For a timestamps, it's almost each value that is unique.

 

View solution in original post

3 REPLIES 3

avatar
Super Guru
You can create hive external table to link to the data in HDFS, and then write data into another table which will be partitioned by date.

Use date functions in Hive to convert timestamp to the value you want:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

INSERT OVERWRITE TABLE partitioned_table PARTITION (date_column) SELECT ...., to_date(timestamp_column) as date_column FROM source_table;

Something like this.

avatar

Hi,

 

We have a timestamp in a table and we want o take date out of it then we can write a select statement to_date(timestamp column) from table name.

 

 

The question is while creating partition i dont have seperate date column in the table i have only timestamp column.

 

 

For static Partitions:

 

create external table hju(starttime string,h string,m string,mv double,country string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/srf/ji/cana/'

 

create table hjuk(h string,m string,mv double,country string)partitioned by (starttime string) location '/hji/lo/p/'

 

INSERT into table tblename PARTITION(starttime='2017-08-09') SELECT h,m,mv,country from tblname where to_date(starttime)='2017-08-09'

 

 

For dynamic partitions:

 

 

SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;

 

create table tblename parg(h string,m string,mv double,country string)partitioned by (starttime string) location '/hiloi/kil'

INSERT overwrite table tblename PARTITION(starttime) SELECT h,m,mv,country ,starttime from tblename

 

 

INFO : Starting task [Stage-2:STATS] in serial mode
INFO : Partition naparg1{starttime=2017-08-01 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=26, rawDataSize=25]
INFO : Partition arg1{starttime=2017-08-02 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
INFO : Partition aparg1{starttime=2017-08-03 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-04 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-05 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-06 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
INFO : Partition arg1{starttime=2017-08-09 18:40:03.0} sta ts: [numFiles=1, numRows=7, totalSize=172, rawDataSize=165]

 

 

avatar
Super Collaborator

The timestamp column is not "suitable" for a partition (unless you want thousands and thousand of partitions).

 

What is suitable :

- is to create an Hive table on top of the current not partitionned data,

- create a second Hive table for hosting the partitionned data (the same columns + the partition column),

- eventualy load the data from the first table to the second one using a query that will "parse" the timestamp column and extract what should be a suitable value for the partition column (for example the year or the year-and-the-month, ...).

 

Example :

 

INSERT INTO TABLE my_partitioned_table PARTITION (part_col_name) SELECT *, year(to_date(my_timestamp_column)) FROM my_not_partitioned_table;

 

You don't have to put the partition value in the insert statement if you enable dynamic partition in Hive.

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;  

 

And on your sample it's not working properly because you didn't parse the timestamp column, you use it as is.

Each unique value will create a partition. For a timestamps, it's almost each value that is unique.