<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Number of intermediate files with Sort shuffle in Spark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-intermediate-files-with-Sort-shuffle-in-Spark/m-p/29771#M6650</link>
    <description>&lt;P&gt;Hi everyone!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:&lt;/P&gt;&lt;P&gt;1) i have 600&amp;nbsp;partitions (HDFS blocks, for simplicity)&lt;/P&gt;&lt;P&gt;2) it place in 6 node cluster&lt;/P&gt;&lt;P&gt;3) i run spark with follow parameters:&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf&lt;/PRE&gt;&lt;P&gt;that's &amp;nbsp;mean that on each server i will have 2 executor with 6 core each.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;4) according my config reduce phase has only 3 reducers.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so, ny question is how many files on &lt;STRONG&gt;each&lt;/STRONG&gt; servers will be after Sort Shuffle:&lt;/P&gt;&lt;P&gt;- 12 like a active map task&amp;nbsp;&lt;/P&gt;&lt;P&gt;- 2 like a number of executors on each server&lt;/P&gt;&lt;P&gt;- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks!&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 09:34:43 GMT</pubDate>
    <dc:creator>fil</dc:creator>
    <dc:date>2022-09-16T09:34:43Z</dc:date>
    <item>
      <title>Number of intermediate files with Sort shuffle in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-intermediate-files-with-Sort-shuffle-in-Spark/m-p/29771#M6650</link>
      <description>&lt;P&gt;Hi everyone!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine:&lt;/P&gt;&lt;P&gt;1) i have 600&amp;nbsp;partitions (HDFS blocks, for simplicity)&lt;/P&gt;&lt;P&gt;2) it place in 6 node cluster&lt;/P&gt;&lt;P&gt;3) i run spark with follow parameters:&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;--executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf&lt;/PRE&gt;&lt;P&gt;that's &amp;nbsp;mean that on each server i will have 2 executor with 6 core each.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;4) according my config reduce phase has only 3 reducers.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;so, ny question is how many files on &lt;STRONG&gt;each&lt;/STRONG&gt; servers will be after Sort Shuffle:&lt;/P&gt;&lt;P&gt;- 12 like a active map task&amp;nbsp;&lt;/P&gt;&lt;P&gt;- 2 like a number of executors on each server&lt;/P&gt;&lt;P&gt;- 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:34:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-intermediate-files-with-Sort-shuffle-in-Spark/m-p/29771#M6650</guid>
      <dc:creator>fil</dc:creator>
      <dc:date>2022-09-16T09:34:43Z</dc:date>
    </item>
    <item>
      <title>Re: Number of intermediate files with Sort shuffle in Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-intermediate-files-with-Sort-shuffle-in-Spark/m-p/30657#M6651</link>
      <description>Hi,

As described in the sort based shuffle design doc (&lt;A href="https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf)," target="_blank"&gt;https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf),&lt;/A&gt; each map task should generate 1 shuffle data file   1 index file.

Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog:

&lt;A href="http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/" target="_blank"&gt;http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/&lt;/A&gt;

Regards,
Bjorn</description>
      <pubDate>Tue, 11 Aug 2015 00:07:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Number-of-intermediate-files-with-Sort-shuffle-in-Spark/m-p/30657#M6651</guid>
      <dc:creator>bjorn.jonsson</dc:creator>
      <dc:date>2015-08-11T00:07:38Z</dc:date>
    </item>
  </channel>
</rss>

