Created 08-17-2016 03:28 AM
Hi,
Can any one please clarify my understanding on the use case difference between 'SparkStandalone' and 'Spark on YARN' cluster.
Spark Standalone Cluster:
If we do not have huge volume of data to process. If number of nodes required to process data are something less than 10 nodes. Then good to go with Standalone cluster.
Spark on YARN Cluster:
If you have huge volume of data to process and had to use more number of nodes and hence you need a better cluster manager to manage these nodes. Then good to go with Spark on YARN cluster.
Also can anyone please let me know the infrastructure specifications required for the 'Spark Standalone' cluster.
For example in the case of 'Spark Standalone' if its having 10 Spark node cluster.
Can we just have 1 reliable hardware/machine for cluster manager as a master node and rest of 9 machines as worker nodes as slave nodes?
Created 08-18-2016 02:49 AM
Use Spark Standalone if you are Spark only shop and you don't care about resource contention with other services from the Hadoop ecosystem. Your Spark uses all resources of your cluster.
If your Spark is part of Hortonworks Data Platform and Spark shares resources like HDFS, use Spark over YARN. That will allow you to allocate proper resources to Spark and avoid resource contention with other services. You can achieve SLA.
I hope this answer helps.
Created 08-17-2016 04:31 AM
Spark Standalone mode is Spark’s own built-in clustered environment. Standalone-Master is the resource manager for the Spark Standalone cluster.Standalone-Worker is the worker in the Spark Standalone cluster. To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster.You can launch standalone cluster either manually, by starting a master and workers by hand, or use launch scripts.
In most enterprises, you already have Hadoop cluster that is running YARN and want to leverage it for resource management instead of additionally running Spark Standalone mode. If using YARN, spark applications will run its spark-master and spark-workers within containers of YARN.
Irrespective of your deployment mode, Spark application will consume same resources it requires to process the data. In case of YARN you have to be aware of what other workloads will be running on cluster (like MR, Tez etc) at same time spark application is executing and size your machines accordingly.
Created 08-17-2016 12:44 PM
@ Rahul am asking about the use case difference. I mean when to use 'SparkStandalone' and when to use 'Spark with YARN' ?
Created 08-18-2016 02:49 AM
Use Spark Standalone if you are Spark only shop and you don't care about resource contention with other services from the Hadoop ecosystem. Your Spark uses all resources of your cluster.
If your Spark is part of Hortonworks Data Platform and Spark shares resources like HDFS, use Spark over YARN. That will allow you to allocate proper resources to Spark and avoid resource contention with other services. You can achieve SLA.
I hope this answer helps.
Created 08-18-2016 03:26 AM
@ Constantin
So can i say SparkStandalone cluster is good for less number of node cluster(maybe less than 10 nodes) because of the fact that resource management performance decreases if we increase the node count in Spark Standalone cluster mode?
Created 08-18-2016 07:49 PM
There is no demonstrated correlation to support that statement. It does not matter the number of nodes. It matters more how resources are used. You can say that for a complex environment where multiple applications and users access resources and SLA is important (jobs need to complete by a given time, users expect a response time under x seconds, etc), a resource manager is a must. As such, running Spark over Yarn just makes sense. It is more solid to deliver in a competitive use of resources environment.
Created 09-21-2016 02:28 PM
If the response was helpful, please vote and accept it as the best answer.