<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to tune Spark for parallel processing when loading small data files in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-tune-Spark-for-parallel-processing-when-loading-small/m-p/170799#M37386</link>
    <description>&lt;P&gt;First, you should try to take advantage if your data is stored in splittable formats (snappy, LZO, bzip2, etc). If so, then instruct Spark to split the data into multiple partitions upon read. In Scala, you can do this:&lt;/P&gt;&lt;PRE&gt;file = sc.textFile(Path, numPartitions)&lt;/PRE&gt;&lt;P&gt;You will also need to tune your YARN container sizes to work with your executor allocation. Make sure your Max Yarn Mem Alloc ('yarn.scheduler.maximum-allocation-mb') is bigger than what you are asking for per executor (this will include the default overhead of 384 MB).&lt;/P&gt;&lt;P&gt;The following parameters are used to allocate Spark executors and driver memory:&lt;/P&gt;&lt;PRE&gt;spark.executor.instances -- number of spark executors
spark.executor.memory -- memory per spark executors (plus 384 MB overhead)
spark.driver.memory -- memory per spark driver&lt;/PRE&gt;&lt;P&gt;6MB file is pretty small, much smaller than HDFS block size, so
you are probably getting a single partition until you do something to
repartition it. You can also set numPartitions parameter like this:&lt;/P&gt;&lt;P&gt;I would probably call one of these repartition methods on
your DataFrame:&lt;/P&gt;&lt;PRE&gt;def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.
&lt;/PRE&gt;&lt;P&gt;OR this:&lt;/P&gt;&lt;PRE&gt;def repartition(numPartitions: Int): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.
&lt;/PRE&gt;</description>
    <pubDate>Wed, 10 Aug 2016 03:43:48 GMT</pubDate>
    <dc:creator>phargis</dc:creator>
    <dc:date>2016-08-10T03:43:48Z</dc:date>
    <item>
      <title>How to tune Spark for parallel processing when loading small data files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-tune-Spark-for-parallel-processing-when-loading-small/m-p/170798#M37385</link>
      <description>&lt;P&gt;The issue is that the input data files to Spark are very small, about 6 MB (&amp;lt;100000
records). However, the required processing/calculations are heavy, which would benefit from running in multiple executors. Currently, all processing is running on a single executor
even when specifying multiple executors to spark-submit.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Aug 2016 03:28:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-tune-Spark-for-parallel-processing-when-loading-small/m-p/170798#M37385</guid>
      <dc:creator>phargis</dc:creator>
      <dc:date>2016-08-10T03:28:48Z</dc:date>
    </item>
    <item>
      <title>Re: How to tune Spark for parallel processing when loading small data files</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-tune-Spark-for-parallel-processing-when-loading-small/m-p/170799#M37386</link>
      <description>&lt;P&gt;First, you should try to take advantage if your data is stored in splittable formats (snappy, LZO, bzip2, etc). If so, then instruct Spark to split the data into multiple partitions upon read. In Scala, you can do this:&lt;/P&gt;&lt;PRE&gt;file = sc.textFile(Path, numPartitions)&lt;/PRE&gt;&lt;P&gt;You will also need to tune your YARN container sizes to work with your executor allocation. Make sure your Max Yarn Mem Alloc ('yarn.scheduler.maximum-allocation-mb') is bigger than what you are asking for per executor (this will include the default overhead of 384 MB).&lt;/P&gt;&lt;P&gt;The following parameters are used to allocate Spark executors and driver memory:&lt;/P&gt;&lt;PRE&gt;spark.executor.instances -- number of spark executors
spark.executor.memory -- memory per spark executors (plus 384 MB overhead)
spark.driver.memory -- memory per spark driver&lt;/PRE&gt;&lt;P&gt;6MB file is pretty small, much smaller than HDFS block size, so
you are probably getting a single partition until you do something to
repartition it. You can also set numPartitions parameter like this:&lt;/P&gt;&lt;P&gt;I would probably call one of these repartition methods on
your DataFrame:&lt;/P&gt;&lt;PRE&gt;def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.
&lt;/PRE&gt;&lt;P&gt;OR this:&lt;/P&gt;&lt;PRE&gt;def repartition(numPartitions: Int): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.
&lt;/PRE&gt;</description>
      <pubDate>Wed, 10 Aug 2016 03:43:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-tune-Spark-for-parallel-processing-when-loading-small/m-p/170799#M37386</guid>
      <dc:creator>phargis</dc:creator>
      <dc:date>2016-08-10T03:43:48Z</dc:date>
    </item>
  </channel>
</rss>

