<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Deployment and Hardware Provisioning in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98264#M11730</link>
    <description>&lt;P&gt;Well just to add, you need one machine for Spark History Server or you can collocate it with other master components i.e. MR History Server etc.&lt;/P&gt;</description>
    <pubDate>Wed, 16 Dec 2015 07:35:16 GMT</pubDate>
    <dc:creator>pardeep_kumar</dc:creator>
    <dc:date>2015-12-16T07:35:16Z</dc:date>
    <item>
      <title>Spark Deployment and Hardware Provisioning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98259#M11725</link>
      <description>&lt;P&gt;Is it a good idea to have separate nodes just for Spark and node label them?  I'm not sure if this is best practice.

  I can see why since according to &lt;A href="http://spark.apache.org/docs/latest/hardware-provisioning.html"&gt;http://spark.apache.org/docs/latest/hardware-prov...&lt;/A&gt; Spark should be allocated 75% of memory.  But with YARN this may not be needed right?&lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2015 21:38:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98259#M11725</guid>
      <dc:creator>amcbarnett</dc:creator>
      <dc:date>2015-12-09T21:38:00Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Deployment and Hardware Provisioning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98260#M11726</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/369/amcbarnett.html" nodeid="369"&gt;@Ancil McBarnett&lt;/A&gt;&lt;P&gt;If cluster is going to be heavily used for spark then definitely good idea to allocate dedicated resources to spark components. Also, make sure that spark client has enough memory too. &lt;/P&gt;</description>
      <pubDate>Wed, 09 Dec 2015 23:00:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98260#M11726</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-12-09T23:00:25Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Deployment and Hardware Provisioning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98261#M11727</link>
      <description>&lt;P&gt;If you are running spark applications on a YARN cluster then you do not need to directly allocate memory or machines to it. &lt;/P&gt;&lt;P&gt;You can dedicate machines via labels, either for exclusive workloads or&lt;/P&gt;&lt;P&gt; to handle heterogenous hardware better. If there is some application where latency and the ability to respond immediately to spikes in load matters, then dedicated labels work. For example; HBase in interactive applications. If different parts of the cluster have different hardware configurations (example: RAM, GPU, SSD for local storage), then labels helps you schedule jobs which need those features to only be executed on those machines&lt;/P&gt;&lt;P&gt;Once you start using labels, the labelled hosts will be underutilized when that specific work isn't running: the permanent tradeoff. &lt;/P&gt;&lt;P&gt;If you are just running queries on a cluster where that latency isn't so critical that you want to pre-allocate capacity on isolated machines, —then using queues makes is more efficient.&lt;/P&gt;&lt;P&gt;You can also set up queue priorities and pre-emption, so your important spark queries can actually pre-empt (i.e. kill) ongoing work from lower-priority applications.
&lt;/P&gt;&lt;P&gt;What is important for Spark is having your jobs ask for the memory they really need: Spark likes a lot, and if the spark JVM/python code consumes more than was allocated to them in the Yarn container requests, the processes may get killed. &lt;/P&gt;</description>
      <pubDate>Sun, 13 Dec 2015 23:35:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98261#M11727</guid>
      <dc:creator>stevel</dc:creator>
      <dc:date>2015-12-13T23:35:55Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Deployment and Hardware Provisioning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98262#M11728</link>
      <description>&lt;P&gt;I must disagree. Dedicating machines via labels is not always the right choice. Imagine you give 20 nodes in a 100 node cluster the label "spark", with only spark-queue work able to run on it. When there's no work on that queue: the machines are idle. When there is work in the queue, it'll only get run on those 20 nodes.&lt;/P&gt;&lt;P&gt;There's also replication &amp;amp; data locality to consider: if the data you need isn't on one of those 20 nodes, it'll be remote —which can also hurt performance.&lt;/P&gt;&lt;P&gt;You really need to look at the cluster and workload to make a good choice&lt;/P&gt;</description>
      <pubDate>Sun, 13 Dec 2015 23:41:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98262#M11728</guid>
      <dc:creator>stevel</dc:creator>
      <dc:date>2015-12-13T23:41:41Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Deployment and Hardware Provisioning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98263#M11729</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/264/stevel.html" nodeid="264"&gt;@stevel&lt;/A&gt; I agree with you and that's why I did not mention labeling. &lt;/P&gt;</description>
      <pubDate>Mon, 14 Dec 2015 00:17:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98263#M11729</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-12-14T00:17:22Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Deployment and Hardware Provisioning</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98264#M11730</link>
      <description>&lt;P&gt;Well just to add, you need one machine for Spark History Server or you can collocate it with other master components i.e. MR History Server etc.&lt;/P&gt;</description>
      <pubDate>Wed, 16 Dec 2015 07:35:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Deployment-and-Hardware-Provisioning/m-p/98264#M11730</guid>
      <dc:creator>pardeep_kumar</dc:creator>
      <dc:date>2015-12-16T07:35:16Z</dc:date>
    </item>
  </channel>
</rss>

