I am trying to write a Spark application in Python. I know that, most of the Spark transformations such as (map, distinct, etc) run in parallel on the RDDs which are distributed across the cluster.
text_file = sc.textFile("source.txt")
# This code executes in parallel
lines = text_file.map(lambda line: line.split("|"))
If I write some python code that does not use any Spark APIs, then, does it execute in parallel ? For instance, I will do some manipulation using Python functions on the "lines" RDD.
Is there any way to figure out how RDDs are distributed ? that is, which node contains what RDDs ?
When you ran .textFile, already an rdd was created for you. A map is just a transformation which will return another rdd.
You can search for Spark transformations and actions.
If now, you issue an action (for example .collect()), distributed rdds will be processed in parallel. (lines.collect())
You can see distributions by executing a .glom() function on an rdd which will return your data in a [list containing [lists of data in each distribution]]. (lines.glom()).
If you don't create an RDD object and use simple python objects within pyspark, they won't be executed in parallel.
You can use .parallelize() to convert python objects to rdds.