Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how spark works ?

Highlighted

how spark works ?

Expert Contributor
  1. If I create 10 RDD in my pySpark shell, does it mean all these 10 RDD data will reside on Spark Memory?
  2. If I do not delete RDD, will it be in memory forever?
  3. If my dataset size exceeds available RAM size, where will data to stored?
3 REPLIES 3
Highlighted

Re: how spark works ?

@heta desai

1. Yes. RDD use lazy loading/evaluation. That means that an RDD is only loaded to memory if an action is performed on it. If you have 10 RDDs with actions performed on them they will all be loaded in Spark memory.

2. No, Spark will remove data from memory if it is no longer used. However, if you want to force a removal/purge, you can use sqlContext.uncacheTable() or RDD.unpersist().

3. Spark will load what it can in memory. The rest will remain on disk. The downside to that is that for that data on disk, the RDD will have to be read and recalculated every time there is an action against them. So, computations/performance will only be slower due to that, but the jobs/actions would not fail.

As always, if you find this post helpful, don't forget to "accept" answer.

Highlighted

Re: how spark works ?

Expert Contributor

"If my dataset size exceeds available RAM size,The rest will remain on disk." To process the data that resides on disk needs to bring in memory ?

"If I do not delete RDD,Spark will remove data from memory if it is no longer used." The data will be deleted permenantly or it will be stored on stable storage ? and do we need to specify time for how much time RDD will reside in memory ?

when the data source is in real world like twitter, how data storage is perform, like it first stores the data on stable storage and than create rdd to access those data or something else ?

Highlighted

Re: how spark works ?

To process the data that resides on disk needs to bring in memory ?

Yes, all processing has to happen in memory. Spark reserves a portion of the allocated memory specifically for processing. This way there is always memory available for processing even if the memory used to load RDDs is full. To get a better understanding of how memory vs disk persistence relate take a look at the link below.

https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

The data will be deleted permenantly or it will be stored on stable storage ?

Partitions be deleted permanently and will be recalculated every time they are needed, unless you specifically instruct Spark to persist them to disk.

https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Do we need to specify time for how much time RDD will reside in memory ?

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() or sqlContext.uncacheTabel() method.

https://spark.apache.org/docs/latest/programming-guide.html#removing-data

Don't have an account?
Coming from Hortonworks? Activate your account here