Created on 04-21-2016 02:35 PM - edited 09-16-2022 03:15 AM
Hadoop experts!
Say I have a table with:
daily_date | STRING |
fact1 | STRING |
fact2 | STRING |
fact_n | STRING |
where daily_date is of format dd/MM/yyyy (oh horror!).
I want to partition the data by date (either yyyyMMdd, or just yyyyMM), but the current date format is no good. How do I create a table (or multiple staging tables) to deal with this?
At present my ingest script will load local data into an hdfs location, and into an un-partitioned hive staging table containing the daily_date column and the fact columns. From there I do:
INSERT OVERWRITE TABLE {final table} PARTITION (daily_date=${data_date}) SELECT fact1, fact_n FROM {staging table} WHERE daily_date={$data_date} ;
and where ${data_date} is a variable defined as the daily_date in the incoming data (which only comes in one day at a time).
But I can't get the partitioning bit to work because of the dodgy daily_date format.
I wrote this to convert dd/MM/yyyy to yyyyMM, but don't know where to use it.
regexp_replace(substring(from_unixtime(unix_timestamp(daily_date, 'dd/MM/yyyy')),0,7),"-","")
I'm a bit stuck with this, any help would be very gratefully received.
(next thing I'll try is to load data into raw un-partitioned table, select into a new table with a new column nice_date, then select into a third table which is partitioned on nice_date).
Created 04-21-2016 03:17 PM
So essentially you have three options:
a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.
INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging;
b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice
INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016";
INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016";
c) you use dynamic partitioning and let hive figure it out
INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date
from staging;
The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something )
You also need to enable some parameters for dynamic partitioning:
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Created 04-21-2016 03:17 PM
So essentially you have three options:
a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.
INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging;
b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice
INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016";
INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016";
c) you use dynamic partitioning and let hive figure it out
INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date
from staging;
The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something )
You also need to enable some parameters for dynamic partitioning:
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Created 04-22-2016 04:24 PM
Thanks Benjamin, great information.
About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it? [Edit:just done it, yes it is]
To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think).
Then:
INSERT OVERWRITE TABLE final PARTITION (daily_date=${nice_date}) SELECT facts, otherFact<exclude daily_date> FROM staging ;
Should work!
Thanks again.