Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Partitioning - no existing column suitable

avatar

Hadoop experts!

Say I have a table with:

daily_dateSTRING
fact1STRING
fact2STRING
fact_nSTRING

where daily_date is of format dd/MM/yyyy (oh horror!).

I want to partition the data by date (either yyyyMMdd, or just yyyyMM), but the current date format is no good. How do I create a table (or multiple staging tables) to deal with this?

At present my ingest script will load local data into an hdfs location, and into an un-partitioned hive staging table containing the daily_date column and the fact columns. From there I do:

INSERT OVERWRITE TABLE {final table} 
PARTITION (daily_date=${data_date})
SELECT fact1, fact_n
FROM {staging table}
WHERE daily_date={$data_date}
; 

and where ${data_date} is a variable defined as the daily_date in the incoming data (which only comes in one day at a time).

But I can't get the partitioning bit to work because of the dodgy daily_date format.

I wrote this to convert dd/MM/yyyy to yyyyMM, but don't know where to use it.

regexp_replace(substring(from_unixtime(unix_timestamp(daily_date, 'dd/MM/yyyy')),0,7),"-","")

I'm a bit stuck with this, any help would be very gratefully received.

(next thing I'll try is to load data into raw un-partitioned table, select into a new table with a new column nice_date, then select into a third table which is partitioned on nice_date).

1 ACCEPTED SOLUTION

avatar
Master Guru

So essentially you have three options:

a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.

INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging;

b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice

INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016";

INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016";

c) you use dynamic partitioning and let hive figure it out

INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date

from staging;

The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something )

You also need to enable some parameters for dynamic partitioning:

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

View solution in original post

2 REPLIES 2

avatar
Master Guru

So essentially you have three options:

a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.

INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging;

b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice

INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, <all columns but daily_date> from staging where daily_date="12/04/2016";

INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, <all columns but daily_date> from staging where daily_date="11/04/2016";

c) you use dynamic partitioning and let hive figure it out

INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex... as daily_date

from staging;

The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something )

You also need to enable some parameters for dynamic partitioning:

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

avatar

Thanks Benjamin, great information.

About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it? [Edit:just done it, yes it is]

To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think).

Then:

INSERT OVERWRITE TABLE final
PARTITION (daily_date=${nice_date})
SELECT facts, otherFact<exclude daily_date>
FROM staging
;

Should work!

Thanks again.