<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark performance parameter num-executors has no effect performance impact... in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148080#M35892</link>
    <description>&lt;P&gt;Thanks
&lt;A rel="user" href="https://community.cloudera.com/users/10295/mkumar13.html" nodeid="10295"&gt;@Mukesh Kumar&lt;/A&gt; for adding your answer. 
However, I would like to add a note on the general aspects that can be looked up on for improving the performance.&lt;/P&gt;&lt;P&gt;As you rightly said, &lt;STRONG&gt;Parallelism&lt;/STRONG&gt; is one important aspect. &lt;/P&gt;&lt;P&gt;We can adjust parallelism in code by increasing or decreasing the partitions - &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;case where you have lot free cores to run your application and the number of partitions are comparatively less, you could increase the number of partitions and &lt;/LI&gt;&lt;LI&gt;case where you derive an RDD from an existing RDD and you get a very small fraction of the parent RDD, you may not get any benefit in the  child rdd having the same number of partitions as its parent, you can try reducing the number of partitions &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Other aspects that you can consider are - 
&lt;STRONG&gt;Data Locality&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This refer to how near the data and code are.&lt;/P&gt;&lt;P&gt;There could be situations where there are no CPU cycles to start a task on local – spark can decide to &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;WAIT - data movement not required&lt;/LI&gt;&lt;LI&gt;Move over to a free CPU and start the task there – Data need to be moved&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The wait time for CPU can be configured setting &lt;EM&gt;spark.locality.wait*&lt;/EM&gt; properties. Based on the application, we can decide if waiting for more time saves us time compared to the data being shuffled across over the network. &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Serialization&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;There are situations where framework may require to ship the data over the network or persist it. 
In such scenarios, the objects are serialized. Java Serialization is used by default. However, serialization frameworks
like kryo has shown better results than the default serialization. The serialization can be set using spark.serializer&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Memory Management&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Another important aspect that you already stepped up on is Memory Management. 
If the application 
would require not to Persist/Cache the data, you could try reducing the Storage Memory fraction and increasing the 
execution memory. 
This is with the fact that the Execution Memory can evict the storage memory up to the configured threshold for Storage and the 
reverse is not true. 
We can adjust these memory value changing &lt;EM&gt;spark.memory.fraction&lt;/EM&gt;  and &lt;EM&gt;spark.memory.storageFraction&lt;/EM&gt; &lt;/P&gt;&lt;P&gt;Also on a side note, all Java &lt;EM&gt;&lt;STRONG&gt;GC tuning&lt;/STRONG&gt;&lt;/EM&gt; methods could be applied to Spark Applications as well. 
We can collect GC statistics using java options &lt;EM&gt;-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Also note that serializing helps reduce the GC overhead of fighting large number of smaller objects. &lt;/P&gt;&lt;P&gt;Finally, we could most importantly look at our &lt;STRONG&gt;code&lt;/STRONG&gt;, some points to consider could be - &lt;/P&gt;&lt;UL&gt;&lt;LI&gt; Carry forward only data that is worth
   ie. we could consider filtering, schema validation on structured datasets etc upfront before propogating them to the downstream logics/aggregations&lt;/LI&gt;&lt;LI&gt;Know the data size - think before calling collect, use sample/subset for debug/testing&lt;/LI&gt;&lt;LI&gt;Target the low hanging fruits - consider reduceByKey over groupByKey, this helps avoid lot data shuffle over the network &lt;/LI&gt;&lt;LI&gt;Consider using broadcast variables for caching large read only variables &lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Wed, 27 Jul 2016 01:49:41 GMT</pubDate>
    <dc:creator>arunak</dc:creator>
    <dc:date>2016-07-27T01:49:41Z</dc:date>
    <item>
      <title>Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148077#M35889</link>
      <description>&lt;P&gt;I have 8 node amazon cluster and I am trying to optimize my spark job but unable to bring down program execution below 15 minutes.&lt;/P&gt;&lt;P&gt;I have tried executing my spark job with different memory parameters but it not accept and always execute with 16 executors even when i supply 21 or 33. &lt;/P&gt;&lt;P&gt;Please help me what are the possible reasons as below is my command..&lt;/P&gt;&lt;P&gt;nohup hadoop jar
/var/lib/aws/emr/myjar.jar spark-submit
--deploy-mode cluster &lt;STRONG&gt;--num-executors 17 --executor-cores 5 --driver-cores 2
--driver-memory 4g&lt;/STRONG&gt;  --class
class_name s3:validator.jar
-e runtime -v true -t true -r true &amp;amp;&lt;/P&gt;&lt;P&gt;Observation: When i pass 3 executes it default take 4 and execution is longer but other parameters have no effect.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2016 15:58:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148077#M35889</guid>
      <dc:creator>mkumar13</dc:creator>
      <dc:date>2016-07-26T15:58:32Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148078#M35890</link>
      <description>&lt;P&gt;I think applying different
memory parameter sizes are the best we can do with respect to file size to
optimize spark performance except if we have already tuned underlining program.&lt;/P&gt;&lt;P&gt;As i don’t know the operation my team is performing in program but i have suggested need to verify below :-&lt;/P&gt;&lt;P&gt;We can set parallelism at rdd like below:-&lt;/P&gt;&lt;P&gt;Val rdd
=sc.textFile(“somefile”,8) &lt;/P&gt;&lt;P&gt;Second major factor on
performance is because of security like wire encryption having 2x overhead and
data encryption(Ranger KMS) could cause 15 to 20% overhead.&lt;/P&gt;&lt;P&gt;Note: Kerberos have no impact.&lt;/P&gt;&lt;P&gt;Another parameter that need look
is what is the default queue for your spark-submit job, if this is going to
default queue and then override using below to more specialized queue with
below parameter&lt;/P&gt;&lt;P&gt;--queue &amp;lt;if you have queue's
setup&amp;gt;&lt;/P&gt;&lt;P&gt;Please let me know if we check anything else to gain performance....&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2016 18:25:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148078#M35890</guid>
      <dc:creator>mkumar13</dc:creator>
      <dc:date>2016-07-26T18:25:04Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148079#M35891</link>
      <description>&lt;P&gt;Just received feedback from developers that using above approach there are able to utilize 61 virtual cores out of
64.&lt;/P&gt;&lt;P&gt;But performance is still the bottleneck means file still taking same time. Anybody have idea whats wrong going on?&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2016 21:36:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148079#M35891</guid>
      <dc:creator>mkumar13</dc:creator>
      <dc:date>2016-07-26T21:36:09Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148080#M35892</link>
      <description>&lt;P&gt;Thanks
&lt;A rel="user" href="https://community.cloudera.com/users/10295/mkumar13.html" nodeid="10295"&gt;@Mukesh Kumar&lt;/A&gt; for adding your answer. 
However, I would like to add a note on the general aspects that can be looked up on for improving the performance.&lt;/P&gt;&lt;P&gt;As you rightly said, &lt;STRONG&gt;Parallelism&lt;/STRONG&gt; is one important aspect. &lt;/P&gt;&lt;P&gt;We can adjust parallelism in code by increasing or decreasing the partitions - &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;case where you have lot free cores to run your application and the number of partitions are comparatively less, you could increase the number of partitions and &lt;/LI&gt;&lt;LI&gt;case where you derive an RDD from an existing RDD and you get a very small fraction of the parent RDD, you may not get any benefit in the  child rdd having the same number of partitions as its parent, you can try reducing the number of partitions &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Other aspects that you can consider are - 
&lt;STRONG&gt;Data Locality&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;This refer to how near the data and code are.&lt;/P&gt;&lt;P&gt;There could be situations where there are no CPU cycles to start a task on local – spark can decide to &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;WAIT - data movement not required&lt;/LI&gt;&lt;LI&gt;Move over to a free CPU and start the task there – Data need to be moved&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The wait time for CPU can be configured setting &lt;EM&gt;spark.locality.wait*&lt;/EM&gt; properties. Based on the application, we can decide if waiting for more time saves us time compared to the data being shuffled across over the network. &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Serialization&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;There are situations where framework may require to ship the data over the network or persist it. 
In such scenarios, the objects are serialized. Java Serialization is used by default. However, serialization frameworks
like kryo has shown better results than the default serialization. The serialization can be set using spark.serializer&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Memory Management&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Another important aspect that you already stepped up on is Memory Management. 
If the application 
would require not to Persist/Cache the data, you could try reducing the Storage Memory fraction and increasing the 
execution memory. 
This is with the fact that the Execution Memory can evict the storage memory up to the configured threshold for Storage and the 
reverse is not true. 
We can adjust these memory value changing &lt;EM&gt;spark.memory.fraction&lt;/EM&gt;  and &lt;EM&gt;spark.memory.storageFraction&lt;/EM&gt; &lt;/P&gt;&lt;P&gt;Also on a side note, all Java &lt;EM&gt;&lt;STRONG&gt;GC tuning&lt;/STRONG&gt;&lt;/EM&gt; methods could be applied to Spark Applications as well. 
We can collect GC statistics using java options &lt;EM&gt;-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Also note that serializing helps reduce the GC overhead of fighting large number of smaller objects. &lt;/P&gt;&lt;P&gt;Finally, we could most importantly look at our &lt;STRONG&gt;code&lt;/STRONG&gt;, some points to consider could be - &lt;/P&gt;&lt;UL&gt;&lt;LI&gt; Carry forward only data that is worth
   ie. we could consider filtering, schema validation on structured datasets etc upfront before propogating them to the downstream logics/aggregations&lt;/LI&gt;&lt;LI&gt;Know the data size - think before calling collect, use sample/subset for debug/testing&lt;/LI&gt;&lt;LI&gt;Target the low hanging fruits - consider reduceByKey over groupByKey, this helps avoid lot data shuffle over the network &lt;/LI&gt;&lt;LI&gt;Consider using broadcast variables for caching large read only variables &lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Wed, 27 Jul 2016 01:49:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148080#M35892</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-07-27T01:49:41Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148081#M35893</link>
      <description>&lt;P&gt;Just wanted to know if you have dynamic execution enabled and if so, what are the values for initial and max executors. Also could I ask, how many &lt;STRONG&gt;node&lt;/STRONG&gt;, &lt;STRONG&gt;core&lt;/STRONG&gt; per node and &lt;STRONG&gt;RAM &lt;/STRONG&gt;per node.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 02:07:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148081#M35893</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-07-27T02:07:57Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148082#M35894</link>
      <description>&lt;P&gt;Thanks &lt;A href="https://community.hortonworks.com/users/10529/akeezhadath.html"&gt;&lt;/A&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10529/akeezhadath.html" nodeid="10529"&gt;@Arun A K&lt;/A&gt;, i'll verify suggestions on my test case let you know progress if get.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 17:18:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148082#M35894</guid>
      <dc:creator>mkumar13</dc:creator>
      <dc:date>2016-07-27T17:18:51Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148083#M35895</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/10529/akeezhadath.html" nodeid="10529"&gt;@Arun A K&lt;/A&gt;, it has been observed that most of the time is consumed when we write data for downstream in Cassandra as single node is serving to Cassandra cluster. Now we are planning to increase create multiple Cassandra nodes inside the Hadoop cluster for fast writing. I'll keep you update on progress.&lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 17:39:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148083#M35895</guid>
      <dc:creator>mkumar13</dc:creator>
      <dc:date>2016-07-27T17:39:51Z</dc:date>
    </item>
    <item>
      <title>Re: Spark performance parameter num-executors has no effect performance impact...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148084#M35896</link>
      <description>&lt;P&gt;Thanks for the update &lt;A rel="user" href="https://community.cloudera.com/users/10295/mkumar13.html" nodeid="10295"&gt;@Mukesh Kumar&lt;/A&gt;. Is it worth doing a 1-1 write or do you want to explore the BulkLoad option in Cassandra? &lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 20:17:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-performance-parameter-num-executors-has-no-effect/m-p/148084#M35896</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-07-27T20:17:41Z</dc:date>
    </item>
  </channel>
</rss>

