Support Questions
Find answers, ask questions, and share your expertise

How does Persist RDD computes data?

In Spark we have RDD, there are options to persists the RDD in case if we are using the RDD in multiple steps in the code. in general RDD holds a lineage graph and along with lazy evaluation it will be computed when it is needed. Now if I ever wanted to persists the RDD, then I can choose persist option to store the data of the RDD. I believe that the persist RDD data will be stored on the node where it is being computed. If that is the case, then all the RDD data resides in one single node. If I ever make use of the the persist RDD in other lines of the code then does it really uses distributed computing(assuming that all the data are stored in single node)? Is my understanding right? If it is wrong could someone help me to understand.


Rising Star

The persisted RDD data is stored either in memory or on disk, according to the specified level. If it is stored in memory, every partition of the RDD is stored on the executor where it is computed. If all the partitions are on a single executor, then all the RDD data is cached in it. In this case, unless you cause a shuffle, all subsequent operations are performed on that executor. A shuffle can be caused explicitly - using repartition for instance - or implicitly - some operations like groupBy can cause it.

Anyway, from the Spark UI (port 4040 of the node where the driver is running, or if you are using YARN you can access it from the RM UI, through the link "Application Master" of your Spark application) you can check where your data is stored (in the Executors tab) and whether subsequent operations are performed all on the same executor or not (from the Stage UI of the relevant job).