Reply
Highlighted
Explorer
Posts: 28
Registered: ‎08-29-2017
Accepted Solution

Hive partitions based on date from timestamp

I have a raw data where i have a column for timestamp.

 

Now i want to move the file to hive and create partions based on date. Daily partitions.

 

 

From timestamp data how we will create date partions?

 

 

 

 

Cloudera Employee
Posts: 211
Registered: ‎03-23-2015

Re: Hive partitions based on date from timestamp

You can create hive external table to link to the data in HDFS, and then write data into another table which will be partitioned by date.

Use date functions in Hive to convert timestamp to the value you want:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

INSERT OVERWRITE TABLE partitioned_table PARTITION (date_column) SELECT ...., to_date(timestamp_column) as date_column FROM source_table;

Something like this.
Explorer
Posts: 28
Registered: ‎08-29-2017

Re: Hive partitions based on date from timestamp

Hi,

 

We have a timestamp in a table and we want o take date out of it then we can write a select statement to_date(timestamp column) from table name.

 

 

The question is while creating partition i dont have seperate date column in the table i have only timestamp column.

 

 

For static Partitions:

 

create external table hju(starttime string,h string,m string,mv double,country string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/srf/ji/cana/'

 

create table hjuk(h string,m string,mv double,country string)partitioned by (starttime string) location '/hji/lo/p/'

 

INSERT into table tblename PARTITION(starttime='2017-08-09') SELECT h,m,mv,country from tblname where to_date(starttime)='2017-08-09'

 

 

For dynamic partitions:

 

 

SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;

 

create table tblename parg(h string,m string,mv double,country string)partitioned by (starttime string) location '/hiloi/kil'

INSERT overwrite table tblename PARTITION(starttime) SELECT h,m,mv,country ,starttime from tblename

 

 

INFO : Starting task [Stage-2:STATS] in serial mode
INFO : Partition naparg1{starttime=2017-08-01 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=26, rawDataSize=25]
INFO : Partition arg1{starttime=2017-08-02 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
INFO : Partition aparg1{starttime=2017-08-03 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-04 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-05 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=24, rawDataSize=23]
INFO : Partition arg1{starttime=2017-08-06 18:40:03.0} sta ts: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
INFO : Partition arg1{starttime=2017-08-09 18:40:03.0} sta ts: [numFiles=1, numRows=7, totalSize=172, rawDataSize=165]

 

 

Posts: 173
Topics: 8
Kudos: 19
Solutions: 19
Registered: ‎07-16-2015

Re: Hive partitions based on date from timestamp

[ Edited ]

The timestamp column is not "suitable" for a partition (unless you want thousands and thousand of partitions).

 

What is suitable :

- is to create an Hive table on top of the current not partitionned data,

- create a second Hive table for hosting the partitionned data (the same columns + the partition column),

- eventualy load the data from the first table to the second one using a query that will "parse" the timestamp column and extract what should be a suitable value for the partition column (for example the year or the year-and-the-month, ...).

 

Example :

 

INSERT INTO TABLE my_partitioned_table PARTITION (part_col_name) SELECT *, year(to_date(my_timestamp_column)) FROM my_not_partitioned_table;

 

You don't have to put the partition value in the insert statement if you enable dynamic partition in Hive.

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;  

 

And on your sample it's not working properly because you didn't parse the timestamp column, you use it as is.

Each unique value will create a partition. For a timestamps, it's almost each value that is unique.

 

Announcements