<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hive compactions on External table in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142346#M27929</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Make sure these are set properties are set to true as these would merge the small files into one or more big files.&lt;/P&gt;&lt;P&gt;hive.merge.mapfiles&lt;/P&gt;&lt;P&gt;hive.merge.mapredfiles &lt;/P&gt;&lt;P&gt;hive.merge.tezfiles&lt;/P&gt;</description>
    <pubDate>Thu, 12 May 2016 01:28:43 GMT</pubDate>
    <dc:creator>quadoss</dc:creator>
    <dc:date>2016-05-12T01:28:43Z</dc:date>
    <item>
      <title>Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142344#M27927</link>
      <description>&lt;P&gt;	Hi, &lt;/P&gt;&lt;P&gt;	I am currently using Spark streaming to write to an external hive table every 30 mins.&lt;/P&gt;
&lt;PRE&gt;rdd.toDF().write.partitionBy("dt").options(options).format("orc").mode(SaveMode.Append).saveAsTable("table_name")
&lt;/PRE&gt;&lt;P&gt;	The issue with this is it creates lots of small files in HDFS, like so&lt;/P&gt;&lt;PRE&gt;	part-00000
	part-00000_copy_1&lt;/PRE&gt;&lt;P&gt;My table was created with transactions enabled, and I have enabled ACID transactions on the Hive instance however, I can't see any compactions running nor do any get created when I force compaction with ALTER TABLE command. I would expect compaction to run and merge these files as they are very small 200 KB's in size. &lt;/P&gt;&lt;P&gt;Any idea's or help greatly appreciated&lt;/P&gt;</description>
      <pubDate>Wed, 11 May 2016 18:05:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142344#M27927</guid>
      <dc:creator>chrismcg89</dc:creator>
      <dc:date>2016-05-11T18:05:30Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142345#M27928</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Can you please provide an "hdfs dfs -ls -R &amp;lt;table-folder&amp;gt;"&lt;/P&gt;&lt;P&gt;Compaction only operates on tables with delta directories. I suspect that the method you're using (SaveMode.Append) is just appending to the existing partition (or adding a new partition) and not actually creating deltas.&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Eric&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 01:11:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142345#M27928</guid>
      <dc:creator>ewalk</dc:creator>
      <dc:date>2016-05-12T01:11:07Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142346#M27929</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Make sure these are set properties are set to true as these would merge the small files into one or more big files.&lt;/P&gt;&lt;P&gt;hive.merge.mapfiles&lt;/P&gt;&lt;P&gt;hive.merge.mapredfiles &lt;/P&gt;&lt;P&gt;hive.merge.tezfiles&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 01:28:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142346#M27929</guid>
      <dc:creator>quadoss</dc:creator>
      <dc:date>2016-05-12T01:28:43Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142347#M27930</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/804/ericwalk.html" nodeid="804"&gt;@Eric Walk&lt;/A&gt; Thanks, Yes you are correct Spark isn't writing deltas its just adding to the existing partition. &lt;/P&gt;&lt;P&gt;Any idea on how to get Spark to write the delta's?&lt;/P&gt;&lt;PRE&gt;-rw-r--r--   3 cmcguire hdfs          0 2016-05-11 16:36 /test_data/test_test_tbl/_SUCCESS
drwxr-xr-x   - cmcguire hdfs          0 2016-05-11 16:40 /test_data/test_tbl/dt=11-05-2016
-rwxr-xr-x   3 cmcguire hdfs       3750 2016-05-11 16:37 /test_data/test_tbl/dt=11-05-2016/part-00000
-rwxr-xr-x   3 cmcguire hdfs       5468 2016-05-11 16:37 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_1
-rwxr-xr-x   3 cmcguire hdfs       8264 2016-05-11 16:38 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_2
-rwxr-xr-x   3 cmcguire hdfs       7068 2016-05-11 16:38 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_3
-rwxr-xr-x   3 cmcguire hdfs       5157 2016-05-11 16:39 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_4
-rwxr-xr-x   3 cmcguire hdfs      10684 2016-05-11 16:39 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_5
-rwxr-xr-x   3 cmcguire hdfs       4796 2016-05-11 16:40 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_6&lt;/PRE&gt;</description>
      <pubDate>Thu, 12 May 2016 01:55:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142347#M27930</guid>
      <dc:creator>chrismcg89</dc:creator>
      <dc:date>2016-05-12T01:55:11Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142348#M27931</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/716/mbalakrishnan.html" nodeid="716"&gt;@mbalakrishnan&lt;/A&gt;&lt;P&gt;Thanks, yes those properties are set, I believe its something to do with how the data is getting written to Hive via Spark Streaming&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 01:56:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142348#M27931</guid>
      <dc:creator>chrismcg89</dc:creator>
      <dc:date>2016-05-12T01:56:03Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142349#M27932</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt;, I'm not sure you're using the Hive Streaming API, then. I'm not sure how Spark Streaming is setup to write out to hive, so it could be behaving correctly.&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 02:09:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142349#M27932</guid>
      <dc:creator>ewalk</dc:creator>
      <dc:date>2016-05-12T02:09:02Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142350#M27933</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/716/mbalakrishnan.html" nodeid="716"&gt;@mbalakrishnan&lt;/A&gt;, do you think it might be missing it because they are originating from spark, not map, mapred or tez?</description>
      <pubDate>Thu, 12 May 2016 02:09:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142350#M27933</guid>
      <dc:creator>ewalk</dc:creator>
      <dc:date>2016-05-12T02:09:56Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142351#M27934</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt;, &lt;A rel="user" href="https://community.cloudera.com/users/804/ericwalk.html" nodeid="804"&gt;@Eric Walk&lt;/A&gt;,&lt;P&gt;Yes, that could well be the reason.   There are properties for hive to merge spark file.  The property is called hive.merge.sparkfiles by default this is false.  You may want to enable it and also look at this wiki for hive-spark configuration:&lt;/P&gt;&lt;P&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 02:19:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142351#M27934</guid>
      <dc:creator>quadoss</dc:creator>
      <dc:date>2016-05-12T02:19:20Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142352#M27935</link>
      <description>&lt;P&gt;Thanks, I will make sure the Spark version of the property is set&lt;/P&gt;&lt;P&gt;Thanks for the help, I wonder if instead of rdd.toDF().saveAsTable I should be writing insert statements this might force the delta files to be created.&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 02:25:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142352#M27935</guid>
      <dc:creator>chrismcg89</dc:creator>
      <dc:date>2016-05-12T02:25:03Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142353#M27936</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt;, that is probably the case, I'm not very familiar with the way spark is configured. I do know that, generally speaking, unless you explicitly say insert or use hive streaming you don't have deltas and don't need to worry about compaction. The partition append merging is a whole different story...&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 02:29:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142353#M27936</guid>
      <dc:creator>ewalk</dc:creator>
      <dc:date>2016-05-12T02:29:41Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142354#M27937</link>
      <description>&lt;P style="margin-left: 40px;"&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/716/mbalakrishnan.html" nodeid="716"&gt;@mbalakrishnan&lt;/A&gt;, Im currently running Spark Streaming job locally which is writing to the Hive deployed on my cluster. I have added the hive.merge.sparkfiles property. Will this work on files written with the saveAsTable command ?&lt;/P&gt;</description>
      <pubDate>Thu, 12 May 2016 04:59:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142354#M27937</guid>
      <dc:creator>chrismcg89</dc:creator>
      <dc:date>2016-05-12T04:59:09Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142355#M27938</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/10386/cmac458.html" nodeid="10386"&gt;@Chris McGuire&lt;/A&gt; I'm not sure whether this would work on saveAsTable command since I have very limited to no knowledge on spark.  I'm hoping that this property should work for the spark streaming job as well.</description>
      <pubDate>Thu, 12 May 2016 06:06:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142355#M27938</guid>
      <dc:creator>quadoss</dc:creator>
      <dc:date>2016-05-12T06:06:09Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142356#M27939</link>
      <description>&lt;P&gt;Hive Acid tables are not integrated with Spark.    To write to an Acid table in a streaming fashion you could use &lt;A href="https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-StreamingAPIs" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-StreamingAPIs&lt;/A&gt;&lt;/P&gt;&lt;P&gt;(hdfs dfs -ls -R output shows the table to not be in expected format for Acid table.  You can check metastore log for errors regarding compaction, but I would not expect it to work)&lt;/P&gt;</description>
      <pubDate>Fri, 13 May 2016 01:24:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142356#M27939</guid>
      <dc:creator>ekoifman</dc:creator>
      <dc:date>2016-05-13T01:24:50Z</dc:date>
    </item>
    <item>
      <title>Re: Hive compactions on External table</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142357#M27940</link>
      <description>&lt;P&gt;Compaction works only on &lt;STRONG&gt;transactional&lt;/STRONG&gt; table, and to make any table transactional it should meet following properties.&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt; Should be ORC Table &lt;/LI&gt;&lt;LI&gt;Should be bucketed &lt;/LI&gt;&lt;LI&gt;Should be &lt;STRONG&gt;managed table&lt;/STRONG&gt;.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Due you see the last point, you can't run compaction on non transactional table, if you do it from hive you will definitely get error, not sure from spark. &lt;/P&gt;</description>
      <pubDate>Fri, 07 Sep 2018 22:08:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-compactions-on-External-table/m-p/142357#M27940</guid>
      <dc:creator>gaurang_n_shah</dc:creator>
      <dc:date>2018-09-07T22:08:48Z</dc:date>
    </item>
  </channel>
</rss>

