<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: can we apply the partitioning on the already existing Hive table in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130552#M93238</link>
    <description>&lt;P&gt;Hey Benjamin,&lt;/P&gt;&lt;P&gt;Is it good if i put one extra column on the partitioned on clause like PARTITON ON ( MONTHS INT, DAY INT , LOADBALANCER INT). Loadbalancer column in the source database, which is SQL Server, will have value 1 for the normal load. If source table has more data load then  Loadbalancer column will have more values like 1,2,... We can create a stored procedure in sql server will update the OLTP Loadbalancer column values in case we fell that we need to further partition data beyond month &amp;amp; day. &lt;/P&gt;&lt;P&gt;How it will be in long run instead of drop the existing dataset &amp;amp; recreating it ?&lt;/P&gt;</description>
    <pubDate>Mon, 08 Feb 2016 20:30:44 GMT</pubDate>
    <dc:creator>praveen_bora</dc:creator>
    <dc:date>2016-02-08T20:30:44Z</dc:date>
    <item>
      <title>can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130549#M93235</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We have some fact tables which contains large number of rows. We have partition applied on the month right now. It is most more likely that in coming future  we might need to apply partition by week number. As update command is missing in Hive so whenever there is situation to update the historical data we just drop the partition &amp;amp; create a new partition. So applying partitions is necessary. &lt;/P&gt;&lt;P&gt;I am wondering is applying partitioning on the existing columns in hive table POSSIBLE ?  &lt;/P&gt;&lt;P&gt;How to handle the situation where we have to apply the partitioning dynamically based on the load ?&lt;/P&gt;&lt;P&gt;i think dropping the table &amp;amp; recreating table for most of the requirement is not good thing&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 18:44:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130549#M93235</guid>
      <dc:creator>praveen_bora</dc:creator>
      <dc:date>2016-02-08T18:44:49Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130550#M93236</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/2465/praveenbora.html" nodeid="2465"&gt;@Nirvana India&lt;/A&gt;&lt;P&gt;Please see this link &lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterPartition" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterPartition&lt;/A&gt;&lt;/P&gt;&lt;H3&gt;Alter Partition&lt;/H3&gt;&lt;P&gt;Partitions can be added, renamed, exchanged, dropped, or (un)archived by using the PARTITION clause in an ALTER TABLE statement, as described below. To make the metastore aware of partitions that were added directly to HDFS, you can use the metastore check command (&lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions%28MSCKREPAIRTABLE%29"&gt;MSCK&lt;/A&gt;) or on Amazon EMR you can use the RECOVER PARTITIONS option of ALTER TABLE. See &lt;A href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterEitherTableorPartition"&gt;Alter Either Table or Partition&lt;/A&gt; below for more ways to alter partitions.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 19:51:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130550#M93236</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-02-08T19:51:30Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130551#M93237</link>
      <description>&lt;P&gt;You cannot change the partitioning scheme on a table in Hive. This would have to rewrite the complete dataset since partitions are mapped to folders in HDFS.&lt;/P&gt;&lt;P&gt;What you need to do is create a new table with the new partitioning scheme and load the data into it from the old table:&lt;/P&gt;&lt;P&gt;CREATE TABLE NEWPARTITIONING ( COLUMNS ... ) PARTITON ON ( MONTHS INT, DAY INT ) as SELECT * from tablewitholdpartitioning.&lt;/P&gt;&lt;P&gt;Loading a large number of partitions at a time can result in bad loading patterns so be careful and follow the guidelines in my doc:&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data" target="_blank"&gt;http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 20:06:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130551#M93237</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-08T20:06:17Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130552#M93238</link>
      <description>&lt;P&gt;Hey Benjamin,&lt;/P&gt;&lt;P&gt;Is it good if i put one extra column on the partitioned on clause like PARTITON ON ( MONTHS INT, DAY INT , LOADBALANCER INT). Loadbalancer column in the source database, which is SQL Server, will have value 1 for the normal load. If source table has more data load then  Loadbalancer column will have more values like 1,2,... We can create a stored procedure in sql server will update the OLTP Loadbalancer column values in case we fell that we need to further partition data beyond month &amp;amp; day. &lt;/P&gt;&lt;P&gt;How it will be in long run instead of drop the existing dataset &amp;amp; recreating it ?&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 20:30:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130552#M93238</guid>
      <dc:creator>praveen_bora</dc:creator>
      <dc:date>2016-02-08T20:30:44Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130553#M93239</link>
      <description>&lt;P&gt;How long will a reload take? Depends on your load settings ( how many reducers see ppt ) and the data volume. As I said loading a large partitioned table is a bit tricky:&lt;/P&gt;&lt;P&gt;Regarding the partitioning scheme: You can do month, day or just day. Depends on which columns you want to filter on. ( having just day is a bit easier for rollin, rollout but multi-level partitioning is also possible ) &lt;/P&gt;&lt;P&gt;Loadbalancer idea in a sec&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 20:51:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130553#M93239</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-08T20:51:50Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130554#M93240</link>
      <description>&lt;P&gt;I do not understand the LOADBALANCER idea. Partitions only help you if you can filter based on the partition. But you will never run a query SELECT * FROM TABLE WHERE LOADBALANCER = 1 right? Then Hive needs to read all folders anyway which will actually result in worse performance in the HiveServer ( lots of partitions result in overhead in the split generation ). The day partitioning only  helps you if you have queries like SELECT * FROM TABLE WHERE DAY = 3;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 20:55:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130554#M93240</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-08T20:55:04Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130555#M93241</link>
      <description>&lt;P&gt;We are now loading our existing historical data into hive. Major fact tables have around 2 million or more rows. Loading 1.7 million rows took 3 hours in virtual box having 6 cores ,24 GB ram &amp;amp; 128GB disk. &lt;/P&gt;&lt;P&gt;I got your point. load balancer column should be some dimension column which is mostly used in the filtering. &lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 21:42:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130555#M93241</guid>
      <dc:creator>praveen_bora</dc:creator>
      <dc:date>2016-02-08T21:42:19Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130556#M93242</link>
      <description>&lt;P&gt;I got your point. Loadbalancer idea is not making sense. I was just thinking to break the data into small datasets so that query just check smaller dataset to prepare the output dataset. I am moving data from hive table (staging &amp;amp; unpartitioned) to another table (production &amp;amp; partitioned). Staging table has 1.7 million rows but query is failing with error &lt;STRONG&gt;Error : Java heap space&lt;/STRONG&gt; . do i need to increase the memory allocated to JVM ? Staging tables might have 5 million &amp;amp; more rows so what should be the most likely value of the memory allocated to the JVM??&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 21:51:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130556#M93242</guid>
      <dc:creator>praveen_bora</dc:creator>
      <dc:date>2016-02-08T21:51:14Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130557#M93243</link>
      <description>&lt;P&gt;Look at the presentation for settings to increase your task RAM. OOM can occur for example in ORC creation. But normally they only occur if a task needs to write to too many partitions at a time therefore using one of the redistribution mechanisms in the ppt is helpful. How do you load them? Sqoop directly? Is your target table ORC? Are you loading one partition (day, month) at a time or a full time range? &lt;/P&gt;&lt;P&gt;As explained in the PPT dynamically loading into dozens of partitions easily leads into OOM exceptions. Redistribute your data to increase load performance and avoid OOM.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 22:07:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130557#M93243</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-08T22:07:01Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130558#M93244</link>
      <description>&lt;P&gt;1. Can you tell me the url of the presentation so that i can increase the RAM ?&lt;/P&gt;&lt;P&gt;2. I have import data from sql server to hive table without specifying the any file format and data import successfully into hive table. Now i am trying to copy data from hive table to another table which as parquet format defined at table creation. I am trying to copy into all the partitions which are possible based on the combination of the three columns.&lt;/P&gt;&lt;P&gt;I have used &lt;STRONG&gt;insert into table t1 partition (c1,c2,c3) select * from t2. &lt;/STRONG&gt; Coping from one table to another (Parquet).&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 22:50:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130558#M93244</guid>
      <dc:creator>praveen_bora</dc:creator>
      <dc:date>2016-02-08T22:50:49Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130559#M93245</link>
      <description>&lt;P&gt;&lt;A href="http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data"&gt;http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Are you using CDH or HDP? In HDP I would propose ORC format. Its very similar to Parquet and just better supported and tested. &lt;/P&gt;&lt;P&gt;If your load from SQL Server is slow its most likely not the hive creation but sqoop. So you could increase the number of mappers but there might not be an easy fix. If you have the problems in the INSERT INTO you can look into the PPT for tips. ( Specifically the distribution methods near the end ) &lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 23:16:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130559#M93245</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-08T23:16:13Z</dc:date>
    </item>
    <item>
      <title>Re: can we apply the partitioning on the already existing Hive table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130560#M93246</link>
      <description>&lt;P&gt;we are on the CDH. I will have a look on the PPT. Can you answer my another comment on &lt;A href="https://community.hortonworks.com/questions/14313/facing-issues-while-ingesting-data-into-hive.html" target="_blank"&gt;https://community.hortonworks.com/questions/14313/facing-issues-while-ingesting-data-into-hive.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Feb 2016 23:26:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/can-we-apply-the-partitioning-on-the-already-existing-Hive/m-p/130560#M93246</guid>
      <dc:creator>praveen_bora</dc:creator>
      <dc:date>2016-02-08T23:26:29Z</dc:date>
    </item>
  </channel>
</rss>

