<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hive partition - partition column as part of the data? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236009#M84946</link>
    <description>&lt;P&gt;if i partition a table by year - can i further bucket it by month?&lt;/P&gt;&lt;P&gt;so the idea is the year will be the top level and months will be at a level beneath it.&lt;/P&gt;&lt;P&gt;so the directory structure would be :&lt;/P&gt;&lt;P&gt;2018 -&amp;gt; 1, 2, 3 ... 12&lt;/P&gt;&lt;P&gt;2019 -&amp;gt; 1, 2, 3 ... 12&lt;/P&gt;&lt;P&gt;Is this what bucketing is about? Or should i be doing this someway with partitions itself?&lt;/P&gt;&lt;P&gt;Appreciate the insights.&lt;/P&gt;</description>
    <pubDate>Sat, 17 Nov 2018 01:54:58 GMT</pubDate>
    <dc:creator>vtpcnk</dc:creator>
    <dc:date>2018-11-17T01:54:58Z</dc:date>
    <item>
      <title>Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236003#M84940</link>
      <description>&lt;P&gt;i have a table with two string columns and one datetime column (which is also defined as string datatype).&lt;/P&gt;&lt;P&gt;I want to partition the table on monthly basis ie month(the datetime column).&lt;/P&gt;&lt;P&gt;So I did the below :&lt;/P&gt;&lt;P&gt;create table test_part_tbl (id string, cd string, dttm string) partitioned by (mth string);&lt;/P&gt;&lt;P&gt;insert into test_part_tbl partition(mth) select id, cd, create_dt, month(create_dt) from real_table;&lt;/P&gt;&lt;P&gt;hive&amp;gt; select * from test_part_tbl ; &lt;/P&gt;&lt;P&gt;OK &lt;/P&gt;&lt;P&gt;test_part_tbl.id        test_part_tbl.cd        test_part_tbl.dttm      test_part_tbl.mth &lt;/P&gt;&lt;P&gt;id1  cd1   2018-10-24   10 &lt;/P&gt;&lt;P&gt;id2  cd1   2018-10-24   10 &lt;/P&gt;&lt;P&gt;Time taken: 0.13 seconds, Fetched: 2 row(s)&lt;/P&gt;&lt;P&gt;So the month ie "10" is actually appearing as part of the table data. Is that correct?&lt;/P&gt;&lt;P&gt;Is it possible to partition the table as above and not have the partition column/value as part of the table data? ie when querying can't I use "month(dttm)" and search based on month?&lt;/P&gt;&lt;P&gt;Appreciate the insights.&lt;/P&gt;</description>
      <pubDate>Thu, 15 Nov 2018 03:13:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236003#M84940</guid>
      <dc:creator>vtpcnk</dc:creator>
      <dc:date>2018-11-15T03:13:55Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236004#M84941</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/1485/vtpcnk.html" nodeid="1485"&gt;@n c&lt;/A&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&lt;EM&gt;So the month ie "10" is actually appearing as part of the table data. Is that correct?&lt;/EM&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list.&lt;/P&gt;&lt;P&gt;Partitions are going to boost the query performance when we are using partition column in out where clause.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Example:&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;if you want to count number of records are in mth=10 then&lt;/P&gt;&lt;PRE&gt;select count(*) from test_par_tbl where mth=10;&lt;/PRE&gt;&lt;P&gt;Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans.&lt;/P&gt;&lt;P&gt;2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan &lt;/P&gt;&lt;PRE&gt;select count(*) from test_par_tbl where month(create_dt)=10;&lt;/PRE&gt;&lt;P&gt;Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&lt;EM&gt;Is it possible to partition the table as above and not have the partition column/value as part of the table data?&lt;/EM&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset.&lt;/P&gt;&lt;P&gt;If you still want to take off the partition column from the dataset, then &lt;STRONG&gt;create a view on top&lt;/STRONG&gt; of the partition_table it by excluding the column.&lt;/P&gt;</description>
      <pubDate>Thu, 15 Nov 2018 10:03:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236004#M84941</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-11-15T10:03:41Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236005#M84942</link>
      <description>&lt;P&gt;The below hive performance parameter - is it usually set within a map reduce program to be set at the time of execution :&lt;/P&gt;&lt;P&gt;SET hive.exec.parallel=true &lt;/P&gt;&lt;P&gt;Or can it be set at the global level in Ambari?&lt;/P&gt;&lt;P&gt;Appreciate the feedback.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Nov 2018 00:42:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236005#M84942</guid>
      <dc:creator>vtpcnk</dc:creator>
      <dc:date>2018-11-16T00:42:18Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236006#M84943</link>
      <description>&lt;P&gt;login to hive server and run the above command &lt;/P&gt;</description>
      <pubDate>Fri, 16 Nov 2018 00:49:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236006#M84943</guid>
      <dc:creator>duraitulasi</dc:creator>
      <dc:date>2018-11-16T00:49:18Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236007#M84944</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1485/vtpcnk.html" nodeid="1485"&gt;@n c&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Once you white list the param in ambari then you are able to set the parameter in hive cli.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Nov 2018 01:17:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236007#M84944</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-11-16T01:17:25Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236008#M84945</link>
      <description>&lt;P&gt;actually I am working with cloudera now and i dont see hive.exec.parallel as a configurable option in cloudera manager.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Nov 2018 01:53:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236008#M84945</guid>
      <dc:creator>vtpcnk</dc:creator>
      <dc:date>2018-11-16T01:53:34Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236009#M84946</link>
      <description>&lt;P&gt;if i partition a table by year - can i further bucket it by month?&lt;/P&gt;&lt;P&gt;so the idea is the year will be the top level and months will be at a level beneath it.&lt;/P&gt;&lt;P&gt;so the directory structure would be :&lt;/P&gt;&lt;P&gt;2018 -&amp;gt; 1, 2, 3 ... 12&lt;/P&gt;&lt;P&gt;2019 -&amp;gt; 1, 2, 3 ... 12&lt;/P&gt;&lt;P&gt;Is this what bucketing is about? Or should i be doing this someway with partitions itself?&lt;/P&gt;&lt;P&gt;Appreciate the insights.&lt;/P&gt;</description>
      <pubDate>Sat, 17 Nov 2018 01:54:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236009#M84946</guid>
      <dc:creator>vtpcnk</dc:creator>
      <dc:date>2018-11-17T01:54:58Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236010#M84947</link>
      <description>&lt;P&gt;I tried this but wouldn't work :&lt;/P&gt;&lt;P&gt;create table test_part_bkt_tbl (id string, cd string, dttm string) partitioned by (yr string) clustered by (month(dttm)) into 12 buckets;&lt;/P&gt;</description>
      <pubDate>Sat, 17 Nov 2018 02:11:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236010#M84947</guid>
      <dc:creator>vtpcnk</dc:creator>
      <dc:date>2018-11-17T02:11:23Z</dc:date>
    </item>
    <item>
      <title>Re: Hive partition - partition column as part of the data?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236011#M84948</link>
      <description>&lt;P&gt;ok, this can be done simply as : partitioned by (yr string, mth string). &lt;/P&gt;&lt;P&gt;tks.&lt;/P&gt;</description>
      <pubDate>Sat, 17 Nov 2018 02:18:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-partition-partition-column-as-part-of-the-data/m-p/236011#M84948</guid>
      <dc:creator>vtpcnk</dc:creator>
      <dc:date>2018-11-17T02:18:29Z</dc:date>
    </item>
  </channel>
</rss>

