<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to reduce the small problem in spark using coalesce or otherwise? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188321#M65094</link>
    <description>&lt;P&gt;Thanks for the response . I am sorry I dont really understand what you mean could you please provide an example?&lt;/P&gt;</description>
    <pubDate>Wed, 19 Jul 2017 00:08:28 GMT</pubDate>
    <dc:creator>Former Member</dc:creator>
    <dc:date>2017-07-19T00:08:28Z</dc:date>
    <item>
      <title>How to reduce the small problem in spark using coalesce or otherwise?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188319#M65092</link>
      <description>&lt;P&gt;When I insert my dataframe into a table it creates some small files.&lt;/P&gt;&lt;P&gt;One solution I had was to use to coalesce to one file but this greatly slows down the code. I am looking at a way to either improve this by somehow speeding it up while still coalescing to 1.&lt;/P&gt;&lt;P&gt;Like this &lt;/P&gt;&lt;PRE&gt;df_expl.coalesce(1)
  .write.mode("append")
  .partitionBy("p_id")
  .parquet(expl_hdfs_loc)&lt;/PRE&gt;&lt;P&gt;Or I am open to another solution.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jul 2017 04:11:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188319#M65092</guid>
      <dc:creator>Former Member</dc:creator>
      <dc:date>2017-07-18T04:11:30Z</dc:date>
    </item>
    <item>
      <title>Re: How to reduce the small problem in spark using coalesce or otherwise?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188320#M65093</link>
      <description>&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/16464/yanks09champs.html"&gt;na&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Also use DISTRIBUTE BY so data for same partition goes to same reducer.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jul 2017 08:46:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188320#M65093</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-07-18T08:46:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to reduce the small problem in spark using coalesce or otherwise?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188321#M65094</link>
      <description>&lt;P&gt;Thanks for the response . I am sorry I dont really understand what you mean could you please provide an example?&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2017 00:08:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188321#M65094</guid>
      <dc:creator>Former Member</dc:creator>
      <dc:date>2017-07-19T00:08:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to reduce the small problem in spark using coalesce or otherwise?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188322#M65095</link>
      <description>&lt;P&gt;Please see the following link. In your code, you'll need to do a "repartition". What I am trying to say is if you force more data to same reducer, you will create less files. Call repartition function on some key where data for that key will land in same partition.&lt;/P&gt;&lt;P&gt;&lt;A href="https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by"&gt;https://dzone.com/articles/optimize-spark-with-distribute-by-cluster-by&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jul 2017 00:39:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188322#M65095</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-07-19T00:39:32Z</dc:date>
    </item>
    <item>
      <title>Re: How to reduce the small problem in spark using coalesce or otherwise?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188323#M65096</link>
      <description>&lt;P&gt;Thanks for the response and the link is very useful. I had one more question if I do df.repartition(1).write will this only run on one node or will this run in a distributed way but only create one file. Which is the problem I face with coalesce when I right to parquet it writes one file but only on one node and I lose the whole distributed advantages. &lt;/P&gt;&lt;P&gt;Is there any way I could do something like this &lt;/P&gt;&lt;P&gt;joinedDF.repartition(col("partitionCol)).coalesce(1).write.mode("append").partitionBy("partitionCol").parquet(esdLocation) but it should only coalesce per partition.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 09 Aug 2017 20:49:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188323#M65096</guid>
      <dc:creator>Former Member</dc:creator>
      <dc:date>2017-08-09T20:49:16Z</dc:date>
    </item>
    <item>
      <title>Re: How to reduce the small problem in spark using coalesce or otherwise?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188324#M65097</link>
      <description>&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/8010/hives-alter-table-partition-concatenate-not-workin.html" target="_blank"&gt;https://community.hortonworks.com/questions/8010/hives-alter-table-partition-concatenate-not-workin.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;this solution seems better...&lt;/P&gt;</description>
      <pubDate>Fri, 25 May 2018 08:04:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-reduce-the-small-problem-in-spark-using-coalesce-or/m-p/188324#M65097</guid>
      <dc:creator>brijesh_jaggi1</dc:creator>
      <dc:date>2018-05-25T08:04:37Z</dc:date>
    </item>
  </channel>
</rss>

