RDDin my pySpark shell, does it mean all these 10
RDDdata will reside on Spark Memory?
RDD, will it be in memory forever?
1. Yes. RDD use lazy loading/evaluation. That means that an RDD is only loaded to memory if an action is performed on it. If you have 10 RDDs with actions performed on them they will all be loaded in Spark memory.
2. No, Spark will remove data from memory if it is no longer used. However, if you want to force a removal/purge, you can use sqlContext.uncacheTable() or RDD.unpersist().
3. Spark will load what it can in memory. The rest will remain on disk. The downside to that is that for that data on disk, the RDD will have to be read and recalculated every time there is an action against them. So, computations/performance will only be slower due to that, but the jobs/actions would not fail.
As always, if you find this post helpful, don't forget to "accept" answer.
"If my dataset size exceeds available RAM size,The rest will remain on disk." To process the data that resides on disk needs to bring in memory ?
"If I do not delete
RDD,Spark will remove data from memory if it is no longer used." The data will be deleted permenantly or it will be stored on stable storage ? and do we need to specify time for how much time RDD will reside in memory ?
when the data source is in real world like twitter, how data storage is perform, like it first stores the data on stable storage and than create rdd to access those data or something else ?
To process the data that resides on disk needs to bring in memory ?
Yes, all processing has to happen in memory. Spark reserves a portion of the allocated memory specifically for processing. This way there is always memory available for processing even if the memory used to load RDDs is full. To get a better understanding of how memory vs disk persistence relate take a look at the link below.
The data will be deleted permenantly or it will be stored on stable storage ?
Partitions be deleted permanently and will be recalculated every time they are needed, unless you specifically instruct Spark to persist them to disk.
Do we need to specify time for how much time RDD will reside in memory ?
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() or sqlContext.uncacheTabel() method.