<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Streaming Creating  Small files in Hive in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188332#M70654</link>
    <description>&lt;P&gt;Merge is not happening because you are writing with Spark, not through Hive, thus all these configurations don't apply.&lt;/P&gt;&lt;P&gt;Here you might have two reasons causing the big amount of files:&lt;/P&gt;&lt;P&gt; 1 - Spark has a default parallelism of 200 and it writes one file per partition, thus each Spark minibatch will write 200 files. This can be easily solved, especially if you are not writing a lot of data at each minibatch reducing the parallelism before writing using `coalesce` (eventually using 1 to write only 1 file per minibatch).&lt;/P&gt;&lt;P&gt; 2 - Spark will anyway write (at least) one file per minibatch and this depends on the frequency you are scheduling them. In this case, the solution is to schedule periodically a CONCATENATE job (but be careful you might encounter HIVE-17280-&amp;gt;HIVE-17403) or you can write your own application with your logic to do the concatenation.&lt;/P&gt;</description>
    <pubDate>Fri, 03 Nov 2017 16:04:37 GMT</pubDate>
    <dc:creator>mgaido1</dc:creator>
    <dc:date>2017-11-03T16:04:37Z</dc:date>
    <item>
      <title>Spark Streaming Creating  Small files in Hive</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188331#M70653</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have a spark streaming application which analysis log files and processes them. Eventually it dumps the processed results in a Hive Table (Internal). But the problem with this is that when spark loads the data, it creates small files and I have all the options in Hive configuration with regards to merging set to True. But still merging isnt happening. Please check the image of the config parameters attached. Any help will be greatly appreciated.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Chandra&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="43417-hive-config.png" style="width: 1005px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/18659iE29D816259F7C6F4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="43417-hive-config.png" alt="43417-hive-config.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 07:55:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188331#M70653</guid>
      <dc:creator>Chandra</dc:creator>
      <dc:date>2019-08-18T07:55:05Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming Creating  Small files in Hive</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188332#M70654</link>
      <description>&lt;P&gt;Merge is not happening because you are writing with Spark, not through Hive, thus all these configurations don't apply.&lt;/P&gt;&lt;P&gt;Here you might have two reasons causing the big amount of files:&lt;/P&gt;&lt;P&gt; 1 - Spark has a default parallelism of 200 and it writes one file per partition, thus each Spark minibatch will write 200 files. This can be easily solved, especially if you are not writing a lot of data at each minibatch reducing the parallelism before writing using `coalesce` (eventually using 1 to write only 1 file per minibatch).&lt;/P&gt;&lt;P&gt; 2 - Spark will anyway write (at least) one file per minibatch and this depends on the frequency you are scheduling them. In this case, the solution is to schedule periodically a CONCATENATE job (but be careful you might encounter HIVE-17280-&amp;gt;HIVE-17403) or you can write your own application with your logic to do the concatenation.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Nov 2017 16:04:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188332#M70654</guid>
      <dc:creator>mgaido1</dc:creator>
      <dc:date>2017-11-03T16:04:37Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming Creating  Small files in Hive</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188333#M70655</link>
      <description>&lt;P&gt;Thanks very much. I see now whats going on. I tried both of your suggestions and seem to work well&lt;/P&gt;</description>
      <pubDate>Fri, 03 Nov 2017 23:17:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Streaming-Creating-Small-files-in-Hive/m-p/188333#M70655</guid>
      <dc:creator>Chandra</dc:creator>
      <dc:date>2017-11-03T23:17:11Z</dc:date>
    </item>
  </channel>
</rss>

