Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Partitioning - no existing column suitable

avatar
Contributor

Hadoop experts!

Say I have a table with:

daily_dateSTRING
fact1STRING
fact2STRING
fact_nSTRING

where daily_date is of format dd/MM/yyyy (oh horror!).

I want to partition the data by date (either yyyyMMdd, or just yyyyMM), but the current date format is no good. How do I create a table (or multiple staging tables) to deal with this?

At present my ingest script will load local data into an hdfs location, and into an un-partitioned hive staging table containing the daily_date column and the fact columns. From there I do:

INSERT OVERWRITE TABLE {final table} 
PARTITION (daily_date=${data_date})
SELECT fact1, fact_n
FROM {staging table}
WHERE daily_date={$data_date}
; 

and where ${data_date} is a variable defined as the daily_date in the incoming data (which only comes in one day at a time).

But I can't get the partitioning bit to work because of the dodgy daily_date format.

I wrote this to convert dd/MM/yyyy to yyyyMM, but don't know where to use it.

regexp_replace(substring(from_unixtime(unix_timestamp(daily_date, 'dd/MM/yyyy')),0,7),"-","")

I'm a bit stuck with this, any help would be very gratefully received.

(next thing I'll try is to load data into raw un-partitioned table, select into a new table with a new column nice_date, then select into a third table which is partitioned on nice_date).

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

Thanks Benjamin, great information.

About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it? [Edit:just done it, yes it is]

To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think).

Then:

INSERT OVERWRITE TABLE final
PARTITION (daily_date=${nice_date})
SELECT facts, otherFact<exclude daily_date>
FROM staging
;

Should work!

Thanks again.