Support Questions

Find answers, ask questions, and share your expertise

Partitioning - no existing column suitable

avatar
Contributor

Hadoop experts!

Say I have a table with:

daily_dateSTRING
fact1STRING
fact2STRING
fact_nSTRING

where daily_date is of format dd/MM/yyyy (oh horror!).

I want to partition the data by date (either yyyyMMdd, or just yyyyMM), but the current date format is no good. How do I create a table (or multiple staging tables) to deal with this?

At present my ingest script will load local data into an hdfs location, and into an un-partitioned hive staging table containing the daily_date column and the fact columns. From there I do:

INSERT OVERWRITE TABLE {final table} 
PARTITION (daily_date=${data_date})
SELECT fact1, fact_n
FROM {staging table}
WHERE daily_date={$data_date}
; 

and where ${data_date} is a variable defined as the daily_date in the incoming data (which only comes in one day at a time).

But I can't get the partitioning bit to work because of the dodgy daily_date format.

I wrote this to convert dd/MM/yyyy to yyyyMM, but don't know where to use it.

regexp_replace(substring(from_unixtime(unix_timestamp(daily_date, 'dd/MM/yyyy')),0,7),"-","")

I'm a bit stuck with this, any help would be very gratefully received.

(next thing I'll try is to load data into raw un-partitioned table, select into a new table with a new column nice_date, then select into a third table which is partitioned on nice_date).

1 ACCEPTED SOLUTION

avatar
Master Guru

So essentially you have three options:

a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.

INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging;

b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice

INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016";

INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016";

c) you use dynamic partitioning and let hive figure it out

INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date

from staging;

The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something )

You also need to enable some parameters for dynamic partitioning:

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

View solution in original post

2 REPLIES 2

avatar
Master Guru

So essentially you have three options:

a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.

INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging;

b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice

INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016";

INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016";

c) you use dynamic partitioning and let hive figure it out

INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date

from staging;

The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something )

You also need to enable some parameters for dynamic partitioning:

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

avatar
Contributor

Thanks Benjamin, great information.

About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it? [Edit:just done it, yes it is]

To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think).

Then:

INSERT OVERWRITE TABLE final
PARTITION (daily_date=${nice_date})
SELECT facts, otherFact<exclude daily_date>
FROM staging
;

Should work!

Thanks again.