<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Partitioning - no existing column suitable in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109897#M25823</link>
    <description>&lt;P&gt;Hadoop experts! &lt;/P&gt;&lt;P&gt;Say I have a table with: &lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;daily_date&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact1&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact2&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_n&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;where daily_date is of format dd/MM/yyyy (oh horror!).&lt;/P&gt;&lt;P&gt;I want to partition the data by date (either yyyyMMdd, or just yyyyMM), but the current date format is no good. How do I create a table (or multiple staging tables) to deal with this? &lt;/P&gt;&lt;P&gt;At present my ingest script will load local data into an hdfs location, and into an un-partitioned hive staging table containing the daily_date column and the fact columns. From there I do: &lt;/P&gt;&lt;PRE&gt;INSERT OVERWRITE TABLE {final table} 
PARTITION (daily_date=${data_date})
SELECT fact1, fact_n
FROM {staging table}
WHERE daily_date={$data_date}
; &lt;/PRE&gt;&lt;P&gt;and where ${data_date} is a variable defined as the daily_date in the incoming data (which only comes in one day at a time). &lt;/P&gt;&lt;P&gt;But I can't get the partitioning  bit to work because of the dodgy daily_date format. &lt;/P&gt;&lt;P&gt;I wrote this to convert dd/MM/yyyy to yyyyMM, but don't know where to use it. &lt;/P&gt;&lt;PRE&gt;regexp_replace(substring(from_unixtime(unix_timestamp(daily_date, 'dd/MM/yyyy')),0,7),"-","")&lt;/PRE&gt;&lt;P&gt;I'm a bit stuck with this, any help would be very gratefully received.&lt;/P&gt;&lt;P&gt;(next thing I'll try is to load data into raw un-partitioned table, select into a new table with a new column nice_date, then select into a third table which is partitioned on nice_date).&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 10:15:10 GMT</pubDate>
    <dc:creator>rachel_wijsmull</dc:creator>
    <dc:date>2022-09-16T10:15:10Z</dc:date>
    <item>
      <title>Partitioning - no existing column suitable</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109897#M25823</link>
      <description>&lt;P&gt;Hadoop experts! &lt;/P&gt;&lt;P&gt;Say I have a table with: &lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;daily_date&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact1&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact2&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;fact_n&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;where daily_date is of format dd/MM/yyyy (oh horror!).&lt;/P&gt;&lt;P&gt;I want to partition the data by date (either yyyyMMdd, or just yyyyMM), but the current date format is no good. How do I create a table (or multiple staging tables) to deal with this? &lt;/P&gt;&lt;P&gt;At present my ingest script will load local data into an hdfs location, and into an un-partitioned hive staging table containing the daily_date column and the fact columns. From there I do: &lt;/P&gt;&lt;PRE&gt;INSERT OVERWRITE TABLE {final table} 
PARTITION (daily_date=${data_date})
SELECT fact1, fact_n
FROM {staging table}
WHERE daily_date={$data_date}
; &lt;/PRE&gt;&lt;P&gt;and where ${data_date} is a variable defined as the daily_date in the incoming data (which only comes in one day at a time). &lt;/P&gt;&lt;P&gt;But I can't get the partitioning  bit to work because of the dodgy daily_date format. &lt;/P&gt;&lt;P&gt;I wrote this to convert dd/MM/yyyy to yyyyMM, but don't know where to use it. &lt;/P&gt;&lt;PRE&gt;regexp_replace(substring(from_unixtime(unix_timestamp(daily_date, 'dd/MM/yyyy')),0,7),"-","")&lt;/PRE&gt;&lt;P&gt;I'm a bit stuck with this, any help would be very gratefully received.&lt;/P&gt;&lt;P&gt;(next thing I'll try is to load data into raw un-partitioned table, select into a new table with a new column nice_date, then select into a third table which is partitioned on nice_date).&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 10:15:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109897#M25823</guid>
      <dc:creator>rachel_wijsmull</dc:creator>
      <dc:date>2022-09-16T10:15:10Z</dc:date>
    </item>
    <item>
      <title>Re: Partitioning - no existing column suitable</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109898#M25824</link>
      <description>&lt;P&gt;So essentially you have three options:&lt;/P&gt;&lt;P&gt;a) you know that you only have data from that day in the temp table. So you can just drop the column completely and do a static partition load specifiyng the date in the format you want. Essentially just delete that one column during the insert.&lt;/P&gt;&lt;P&gt;INSERT OVERWRITE TABLE final partition ( daily_date=20160412 ) select id, name, &amp;lt;all columns but daily_date&amp;gt; from staging;&lt;/P&gt;&lt;P&gt;b) you have one or two days in there ( mid day ) so you just filter on the daily_date column but do the same thing otherwise ( in this case you dont want overwrite since the edge days will be loaded twice&lt;/P&gt;&lt;P&gt;INSERT APPEND TABLE final partition ( daily_date=20160412 ) select id, name, &amp;lt;all columns but daily_date&amp;gt; from staging where daily_date="12/04/2016";&lt;/P&gt;&lt;P&gt;INSERT APPEND TABLE final partition ( daily_date=20160411 ) select id, name, &amp;lt;all columns but daily_date&amp;gt; from staging where daily_date="11/04/2016";&lt;/P&gt;&lt;P&gt;c) you use dynamic partitioning and let hive figure it out&lt;/P&gt;&lt;P&gt;INSERT APPEND TABLE final partition ( daily_date ) select id, name, , all columns but daily_date, regex...  as daily_date&lt;/P&gt;&lt;P&gt; from staging;&lt;/P&gt;&lt;P&gt;The only thing is that your partition column needs to be last ( also I like ints instead of string i.e. 20160412 as an integer makes for easier computations like day + 7 or something ) &lt;/P&gt;&lt;P&gt;You also need to enable some parameters for dynamic partitioning:&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data" target="_blank"&gt;http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 21 Apr 2016 22:17:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109898#M25824</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-04-21T22:17:16Z</dc:date>
    </item>
    <item>
      <title>Re: Partitioning - no existing column suitable</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109899#M25825</link>
      <description>&lt;P&gt;Thanks Benjamin, great information. &lt;/P&gt;&lt;P&gt;About option a) - in this case a new (and query-able) meta-column called daily_date in the nice format would be created in the final table, wouldn't it?  [Edit:just done it, yes it is]&lt;/P&gt;&lt;P&gt;To make this work as an automated process where hive -e is called in a shell script, I would just need to set the new daily_date as a variable somewhere before the hive call (I think). &lt;/P&gt;&lt;P&gt;Then:&lt;/P&gt;&lt;PRE&gt;INSERT OVERWRITE TABLE final
PARTITION (daily_date=${nice_date})
SELECT facts, otherFact&amp;lt;exclude daily_date&amp;gt;
FROM staging
;
&lt;/PRE&gt;&lt;P&gt;Should work! &lt;/P&gt;&lt;P&gt;Thanks again.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Apr 2016 23:24:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Partitioning-no-existing-column-suitable/m-p/109899#M25825</guid>
      <dc:creator>rachel_wijsmull</dc:creator>
      <dc:date>2016-04-22T23:24:29Z</dc:date>
    </item>
  </channel>
</rss>

