Support Questions

Find answers, ask questions, and share your expertise

Hive partitions based on date from timestamp

avatar
Rising Star
 
1 ACCEPTED SOLUTION

avatar
Master Guru

@Gayathri Devi

First you need to create a hive non partition table on raw data.

Then you need to create partition table in hive then insert from non partition table to partition table.

For testing i have tried an example as below:-

Right now my hive normal table(i.e not partition table) having these list of records.

Normal table(without partition column):-

hive# create table text_table(id int, dt string,name string) stored as textfile location '/user/yashu/text_table';
hive# select * from text_table;
+----------------+----------------------+------------------+--+
| text_table.id  |    text_table.dt     | text_table.name  |
+----------------+----------------------+------------------+--+
| 1              | 2017-10-31 10:12:09  | foo              |
| 1              | 2017-10-31 12:12:09  | bar              |
| 1              | 2017-10-30 12:12:09  | foobar           |
| 1              | 2017-10-30 10:12:09  | barbar           |
+----------------+----------------------+------------------+--+

Then i want to do daily partition table for this case i need to create a new table having dt as partition column in it

Partition table:-

There are 2 kinds of partitions in hive

1.Static partitions //adding partition statically and loading data into it,takes less time than dynamic partitions as it won't need to look into data while creating partitions.
2.Dynamic partitions //creating partitions dynamically based on the column value, take more time than static partitions if data is huge because it needs to look into data while creating partitions.
hive# create table partition_table(
id int,
name string)
partitioned by (dt string);

1.Dynamic Partition:-

once you create partition table then select from non partition table,

hive# insert into partition_table partition(dt) select id,name, substring(dt,0,10) from text_table; //we need to have daily partition so i'm doing sub string from 0-10 i.e 2017-10-31 so this will create date partitions
INFO  :          Time taken to load dynamic partitions: 0.066 seconds
INFO  :         Loading partition {dt=2017-10-30} //creating 2017-10-30 partition
INFO  :         Loading partition {dt=2017-10-31} //creating 2017-10-30 partition
INFO  :          Time taken for adding to write entity : 0
INFO  : Partition default.partition_table{dt=2017-10-30} stats: [numFiles=1, numRows=2, totalSize=18, rawDataSize=16]
INFO  : Partition default.partition_table{dt=2017-10-31} stats: [numFiles=1, numRows=2, totalSize=12, rawDataSize=10]
No rows affected (10.055 seconds)

We are doing dynamic partitions in our above statement i.e we are creating partition based on our data.

if you want to view the partitions then give

hive# show partitions partition_table; //we can view all partitions that has create in the table.
+----------------+--+
|   partition    |
+----------------+--+
| dt=2017-10-30  |
| dt=2017-10-31  |
+----------------+--+
2 rows selected (0.064 seconds)

Drop partitions:-

hive# alter table partition_table drop partition(dt>'0') purge; //it will drop all the partitions (or) you can drop specific partition by mentioning as dt='2017-10-30'(it will drop only 2017-10-30 partition)
INFO  : Dropped the partition dt=2017-10-30
INFO  : Dropped the partition dt=2017-10-31
No rows affected (0.132 seconds)

To view all partition directories information

 hadoop fs -ls -R /apps/hive/warehouse/partition_table/
drwxrwxrwx   - hdfs          0 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-30 //partition directory
-rwxrwxrwx   3 hdfs         18 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-30/000000_0 //file in the partition
drwxrwxrwx   - hdfs          0 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-31
-rwxrwxrwx   3 hdfs         12 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-31/000000_0

To view data from one partition

select * from partition_table where dt='2017-10-30';
+---------------------+-----------------------+---------------------+--+
| partition_table.id  | partition_table.name  | partition_table.dt  |
+---------------------+-----------------------+---------------------+--+
| 1                   | foobar                | 2017-10-30          |
| 1                   | barbar                | 2017-10-30          |
+---------------------+-----------------------+---------------------+--+

As you can see out dt column in non partitioned table having 2017-10-30 12:12:09 but in partition table having 2017-10-30 because as we are loading the data to partition table we did sub string on dt column.

--> if you don't want to change the source data i.e dt column from non partition table to partition table then create partition table with

hive# create table partition_table(
id int,
name string,
dt string)
partitioned by (daily string); //new partition column
hive# insert into partition_table partition(daily) select id,name,dt, substring(dt,0,10) from text_table; //we are having daily as partition column and in select statement we have used dt column twice one is to load actual dt column data and second one is to create partition column.
show partitions partition_table;
+-------------------+--+
|     partition     |
+-------------------+--+
| daily=2017-10-30  |
| daily=2017-10-31  |
+-------------------+--+
2 rows selected (0.066 seconds)
0: jdbc:hive2://usor7dhc01w01.use.ucdp.net:21> select * from partition_table; //as you can see we haven't changed dt column data as we have new daily column as partition column
+---------------------+-----------------------+----------------------+------------------------+--+
| partition_table.id  | partition_table.name  |  partition_table.dt  | partition_table.daily  |
+---------------------+-----------------------+----------------------+------------------------+--+
| 1                   | foobar                | 2017-10-30 12:12:09  | 2017-10-30             |
| 1                   | barbar                | 2017-10-30 10:12:09  | 2017-10-30             |
| 1                   | foo                   | 2017-10-31 10:12:09  | 2017-10-31             |
| 1                   | bar                   | 2017-10-31 12:12:09  | 2017-10-31             |
+---------------------+-----------------------+----------------------+------------------------+--+

**keep in mind partition column needs to be last column in your select statement, if not hive creates partitions based on what ever the last column is in your select statement.

2.Static partition:-

We are statically creating partition and loading all the data into that partition,

hive# insert into partition_table partition(dt='2017-10-30') select id,name from text_table; //we are mentioned partition name here as dt='2017-10-30' so all data will be loaded into 2017-10-30 partition

if you are doing static partition that means all the dt data should be 2017-10-30 and you can view we haven't mentioned dt column in select statement.

hive# show partitions partition_table;
+----------------+--+
|   partition    |
+----------------+--+
| dt=2017-10-30  |
+----------------+--+

hive# select * from partition_table; //all dt will be 2017-10-30 because we are doing static partition column
+---------------------+-----------------------+---------------------+--+
| partition_table.id  | partition_table.name  | partition_table.dt  |
+---------------------+-----------------------+---------------------+--+
| 1                   | foo                   | 2017-10-30          |
| 1                   | bar                   | 2017-10-30          |
| 1                   | foobar                | 2017-10-30          |
| 1                   | barbar                | 2017-10-30          |
+---------------------+-----------------------+---------------------+--+

As you need to decide which kind of partitions are best fit for your case.

Hope this will help you to understand about partitions..!!

View solution in original post

1 REPLY 1

avatar
Master Guru

@Gayathri Devi

First you need to create a hive non partition table on raw data.

Then you need to create partition table in hive then insert from non partition table to partition table.

For testing i have tried an example as below:-

Right now my hive normal table(i.e not partition table) having these list of records.

Normal table(without partition column):-

hive# create table text_table(id int, dt string,name string) stored as textfile location '/user/yashu/text_table';
hive# select * from text_table;
+----------------+----------------------+------------------+--+
| text_table.id  |    text_table.dt     | text_table.name  |
+----------------+----------------------+------------------+--+
| 1              | 2017-10-31 10:12:09  | foo              |
| 1              | 2017-10-31 12:12:09  | bar              |
| 1              | 2017-10-30 12:12:09  | foobar           |
| 1              | 2017-10-30 10:12:09  | barbar           |
+----------------+----------------------+------------------+--+

Then i want to do daily partition table for this case i need to create a new table having dt as partition column in it

Partition table:-

There are 2 kinds of partitions in hive

1.Static partitions //adding partition statically and loading data into it,takes less time than dynamic partitions as it won't need to look into data while creating partitions.
2.Dynamic partitions //creating partitions dynamically based on the column value, take more time than static partitions if data is huge because it needs to look into data while creating partitions.
hive# create table partition_table(
id int,
name string)
partitioned by (dt string);

1.Dynamic Partition:-

once you create partition table then select from non partition table,

hive# insert into partition_table partition(dt) select id,name, substring(dt,0,10) from text_table; //we need to have daily partition so i'm doing sub string from 0-10 i.e 2017-10-31 so this will create date partitions
INFO  :          Time taken to load dynamic partitions: 0.066 seconds
INFO  :         Loading partition {dt=2017-10-30} //creating 2017-10-30 partition
INFO  :         Loading partition {dt=2017-10-31} //creating 2017-10-30 partition
INFO  :          Time taken for adding to write entity : 0
INFO  : Partition default.partition_table{dt=2017-10-30} stats: [numFiles=1, numRows=2, totalSize=18, rawDataSize=16]
INFO  : Partition default.partition_table{dt=2017-10-31} stats: [numFiles=1, numRows=2, totalSize=12, rawDataSize=10]
No rows affected (10.055 seconds)

We are doing dynamic partitions in our above statement i.e we are creating partition based on our data.

if you want to view the partitions then give

hive# show partitions partition_table; //we can view all partitions that has create in the table.
+----------------+--+
|   partition    |
+----------------+--+
| dt=2017-10-30  |
| dt=2017-10-31  |
+----------------+--+
2 rows selected (0.064 seconds)

Drop partitions:-

hive# alter table partition_table drop partition(dt>'0') purge; //it will drop all the partitions (or) you can drop specific partition by mentioning as dt='2017-10-30'(it will drop only 2017-10-30 partition)
INFO  : Dropped the partition dt=2017-10-30
INFO  : Dropped the partition dt=2017-10-31
No rows affected (0.132 seconds)

To view all partition directories information

 hadoop fs -ls -R /apps/hive/warehouse/partition_table/
drwxrwxrwx   - hdfs          0 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-30 //partition directory
-rwxrwxrwx   3 hdfs         18 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-30/000000_0 //file in the partition
drwxrwxrwx   - hdfs          0 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-31
-rwxrwxrwx   3 hdfs         12 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-31/000000_0

To view data from one partition

select * from partition_table where dt='2017-10-30';
+---------------------+-----------------------+---------------------+--+
| partition_table.id  | partition_table.name  | partition_table.dt  |
+---------------------+-----------------------+---------------------+--+
| 1                   | foobar                | 2017-10-30          |
| 1                   | barbar                | 2017-10-30          |
+---------------------+-----------------------+---------------------+--+

As you can see out dt column in non partitioned table having 2017-10-30 12:12:09 but in partition table having 2017-10-30 because as we are loading the data to partition table we did sub string on dt column.

--> if you don't want to change the source data i.e dt column from non partition table to partition table then create partition table with

hive# create table partition_table(
id int,
name string,
dt string)
partitioned by (daily string); //new partition column
hive# insert into partition_table partition(daily) select id,name,dt, substring(dt,0,10) from text_table; //we are having daily as partition column and in select statement we have used dt column twice one is to load actual dt column data and second one is to create partition column.
show partitions partition_table;
+-------------------+--+
|     partition     |
+-------------------+--+
| daily=2017-10-30  |
| daily=2017-10-31  |
+-------------------+--+
2 rows selected (0.066 seconds)
0: jdbc:hive2://usor7dhc01w01.use.ucdp.net:21> select * from partition_table; //as you can see we haven't changed dt column data as we have new daily column as partition column
+---------------------+-----------------------+----------------------+------------------------+--+
| partition_table.id  | partition_table.name  |  partition_table.dt  | partition_table.daily  |
+---------------------+-----------------------+----------------------+------------------------+--+
| 1                   | foobar                | 2017-10-30 12:12:09  | 2017-10-30             |
| 1                   | barbar                | 2017-10-30 10:12:09  | 2017-10-30             |
| 1                   | foo                   | 2017-10-31 10:12:09  | 2017-10-31             |
| 1                   | bar                   | 2017-10-31 12:12:09  | 2017-10-31             |
+---------------------+-----------------------+----------------------+------------------------+--+

**keep in mind partition column needs to be last column in your select statement, if not hive creates partitions based on what ever the last column is in your select statement.

2.Static partition:-

We are statically creating partition and loading all the data into that partition,

hive# insert into partition_table partition(dt='2017-10-30') select id,name from text_table; //we are mentioned partition name here as dt='2017-10-30' so all data will be loaded into 2017-10-30 partition

if you are doing static partition that means all the dt data should be 2017-10-30 and you can view we haven't mentioned dt column in select statement.

hive# show partitions partition_table;
+----------------+--+
|   partition    |
+----------------+--+
| dt=2017-10-30  |
+----------------+--+

hive# select * from partition_table; //all dt will be 2017-10-30 because we are doing static partition column
+---------------------+-----------------------+---------------------+--+
| partition_table.id  | partition_table.name  | partition_table.dt  |
+---------------------+-----------------------+---------------------+--+
| 1                   | foo                   | 2017-10-30          |
| 1                   | bar                   | 2017-10-30          |
| 1                   | foobar                | 2017-10-30          |
| 1                   | barbar                | 2017-10-30          |
+---------------------+-----------------------+---------------------+--+

As you need to decide which kind of partitions are best fit for your case.

Hope this will help you to understand about partitions..!!