Member since
08-13-2016
9
Posts
2
Kudos Received
0
Solutions
09-07-2016
12:20 PM
I would also like to know how Spark is going to decide the number of partitions for a dataframe.
... View more
09-07-2016
12:17 PM
How can we specify number of partitions while creating a Spark dataframe. Using repartitions we can specify number of partitions for a dataframe, but seems like we do not have option to specify while creating the dataframe. While creating a RDD we can specify number of partitions, but i would like to know for Spark dataframe. Can anyone please assist me on the same.
... View more
Labels:
08-28-2016
05:42 PM
I have set up a Spark Project in IntelliJ Idea. If I execute the main method through IntelliJ, just to print some text, Its able to print the text. Meaning the class is found(No issues with this). In the same class I added below 2 statements to in initialize Spark context. val Conf = new SparkConf().setAppName(appName).setMaster(sparkMaster) val sc = new SparkContext(Conf) I am getting the below error if i execute the same way (through IntelliJ) after the above changes. Error Log: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
... View more
Labels:
08-18-2016
03:26 AM
@ Constantin So can i say SparkStandalone cluster is good for less number of node cluster(maybe less than 10 nodes) because of the fact that resource management performance decreases if we increase the node count in Spark Standalone cluster mode?
... View more
08-17-2016
12:44 PM
@ Rahul am asking about the use case difference. I mean when to use 'SparkStandalone' and when to use 'Spark with YARN' ?
... View more
08-17-2016
03:28 AM
1 Kudo
Hi, Can any one please clarify my understanding on the use case difference between 'SparkStandalone' and 'Spark on YARN' cluster. Spark Standalone Cluster: If we do not have huge volume of data to process. If number of nodes required to process data are something less than 10 nodes. Then good to go with Standalone cluster. Spark on YARN Cluster: If you have huge volume of data to process and had to use more number of nodes and hence you need a better cluster manager to manage these nodes. Then good to go with Spark on YARN cluster. Also can anyone please let me know the infrastructure specifications required for the 'Spark Standalone' cluster. For example in the case of 'Spark Standalone' if its having 10 Spark node cluster. Can we just have 1 reliable hardware/machine for cluster manager as a master node and rest of 9 machines as worker nodes as slave nodes?
... View more
Labels:
08-13-2016
06:28 PM
Here is my understanding where HDFS is not required for Spark: If in case we are migrating the structured data from any database like Oracle to any noSQL data base like Cassandra using Spark/SparkSQL Job, then in this case we do not need any storage like HDFS. Please correct me if am wrong. Thanks
... View more
08-13-2016
06:14 PM
1 Kudo
Hi, Does Apache 'Spark Standalone' need HDFS? If it's required how Spark uses the HDFS block size during the Spark application execution.
I mean am trying to understand what will be the HDFS role during Spark application execution. Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores. Can anyone please help me to understand.
... View more
Labels: