Member since
08-13-2016
9
Posts
2
Kudos Received
0
Solutions
08-18-2016
03:26 AM
@ Constantin So can i say SparkStandalone cluster is good for less number of node cluster(maybe less than 10 nodes) because of the fact that resource management performance decreases if we increase the node count in Spark Standalone cluster mode?
... View more
08-17-2016
12:44 PM
@ Rahul am asking about the use case difference. I mean when to use 'SparkStandalone' and when to use 'Spark with YARN' ?
... View more
08-17-2016
03:28 AM
1 Kudo
Hi, Can any one please clarify my understanding on the use case difference between 'SparkStandalone' and 'Spark on YARN' cluster. Spark Standalone Cluster: If we do not have huge volume of data to process. If number of nodes required to process data are something less than 10 nodes. Then good to go with Standalone cluster. Spark on YARN Cluster: If you have huge volume of data to process and had to use more number of nodes and hence you need a better cluster manager to manage these nodes. Then good to go with Spark on YARN cluster. Also can anyone please let me know the infrastructure specifications required for the 'Spark Standalone' cluster. For example in the case of 'Spark Standalone' if its having 10 Spark node cluster. Can we just have 1 reliable hardware/machine for cluster manager as a master node and rest of 9 machines as worker nodes as slave nodes?
... View more
Labels:
- Labels:
-
Apache Spark
08-13-2016
06:28 PM
Here is my understanding where HDFS is not required for Spark: If in case we are migrating the structured data from any database like Oracle to any noSQL data base like Cassandra using Spark/SparkSQL Job, then in this case we do not need any storage like HDFS. Please correct me if am wrong. Thanks
... View more
08-13-2016
06:14 PM
1 Kudo
Hi, Does Apache 'Spark Standalone' need HDFS? If it's required how Spark uses the HDFS block size during the Spark application execution.
I mean am trying to understand what will be the HDFS role during Spark application execution. Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores. Can anyone please help me to understand.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark