Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

how spark works ?

Expert Contributor
  1. If I create 10 RDD in my pySpark shell, does it mean all these 10 RDD data will reside on Spark Memory?
  2. If I do not delete RDD, will it be in memory forever?
  3. If my dataset size exceeds available RAM size, where will data to stored?
3 REPLIES 3

@heta desai

1. Yes. RDD use lazy loading/evaluation. That means that an RDD is only loaded to memory if an action is performed on it. If you have 10 RDDs with actions performed on them they will all be loaded in Spark memory.

2. No, Spark will remove data from memory if it is no longer used. However, if you want to force a removal/purge, you can use sqlContext.uncacheTable() or RDD.unpersist().

3. Spark will load what it can in memory. The rest will remain on disk. The downside to that is that for that data on disk, the RDD will have to be read and recalculated every time there is an action against them. So, computations/performance will only be slower due to that, but the jobs/actions would not fail.

As always, if you find this post helpful, don't forget to "accept" answer.

Expert Contributor

"If my dataset size exceeds available RAM size,The rest will remain on disk." To process the data that resides on disk needs to bring in memory ?

"If I do not delete RDD,Spark will remove data from memory if it is no longer used." The data will be deleted permenantly or it will be stored on stable storage ? and do we need to specify time for how much time RDD will reside in memory ?

when the data source is in real world like twitter, how data storage is perform, like it first stores the data on stable storage and than create rdd to access those data or something else ?

To process the data that resides on disk needs to bring in memory ?

Yes, all processing has to happen in memory. Spark reserves a portion of the allocated memory specifically for processing. This way there is always memory available for processing even if the memory used to load RDDs is full. To get a better understanding of how memory vs disk persistence relate take a look at the link below.

https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

The data will be deleted permenantly or it will be stored on stable storage ?

Partitions be deleted permanently and will be recalculated every time they are needed, unless you specifically instruct Spark to persist them to disk.

https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Do we need to specify time for how much time RDD will reside in memory ?

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() or sqlContext.uncacheTabel() method.

https://spark.apache.org/docs/latest/programming-guide.html#removing-data

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.