Hi All ,
I am creating spark context in the driver, How does executor gets the spark context ?
Can anyone share any link with this context will be helpful to understand the system much better.
Driver Program is the process that runs the main() function of the application and creates the Spark Context. The Cluster manger then acquires resources on the cluster. After this an executor process is launched on the resources acquired by the cluster manager. The task/s then gets sent to the individual executors for execution.
well, may be my question is misleading you , let me elaborate it.
val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
a simple wordcount problem.. this piece of code given to driver program, which creates DAG and stages, and given task to respective worker nodes where actual operation is happening.
Now, lets look at the first line of the program. From the file, RDD is generated (SparkContext implemented textFile() function which generates RDD from file). file is resides in worker node. from worker node, we needs to get the RDD out.
In order to acheive that , Worker node ( or executor ) needs to have the SparkContext, Isn't it ?
My Question is, How does executor gets the spark context ?
The driver will create the Spark Context. This can either be the spark-shell or Zeppelin, or a standalone Spark application (see http://spark.apache.org/docs/1.6.2/quick-start.html#self-contained-applications to learn how to create a spark context in an application).
To distribute the execution you need to choose "yarn client" or "yarn cluster" mode (not local, which is the default), see
spark-shell --master yarn --deploy-mode client --num-executors 3
This will create a driver with a Spark Context that controls 3 executors (could be on 1,2 or 3 machines, check "ps ax | grep Coarse")
When you now call sc.textFile(...), then Spark will create an RDD by loading on each executor partitions of the file (hence afterwards it is already partitioned!
Every further command will then run distributed.
So it is not you to bring the Spark Context to the executors but the SparkContext is used by the driver to distribute the load across all started executors.
That's why I linked the Spark docs above. You need to first understand the Spark cluster mode. If you only start "spark-shell", it will not be distributed but in "standalone" mode. Only in "yarn client" and "yarn cluster" mode it will be distributed.
Having a distributed Spark Context created in the driver, there is no need to care about it any more. Just use the context and Spark will distribute your load.
Spark Standalone: http://spark.apache.org/docs/latest/spark-standalone.html
Spark on YARN: http://spark.apache.org/docs/latest/running-on-yarn.html
You are looking at it wrong. Spark Context is the main entry point into Spark and is the connection to a Spark cluster, and can be used to create RDDs, accumulators etc. on that cluster. You can run both in cluster as well as local mode and if you would have to define which one, you'd define that in the Spark context. The workers don't get the Spark context per say, but if you were to package your program into a jar, the cluster manager would be responsible for copying the jar file to the workers, before it allocates tasks.