Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark RDD/Dataframe caching

avatar
Contributor

Suppose I have the following piece of code:

val a = sc.textfile("path/to/file")
val b = a.filter(<something..>).groupBy(<something..>)
val c = b.filter(<something..>).groupBy(<something..>)
val d = c.<some transform>
val e = d.<some transform>
val sum1 = e.reduce(<reduce func>)
val sum2 = b.reduce(<reduce func>)

Note that I have not used any cache/persist command.

Since the RDD b is being used again in the last action, will Spark automatically cache it? Or will it be recalculated again from the dataset?

Will the behaviour be the same, if I use DataFrame for the above steps?

Lastly, at any point of time will the RDDs c or d exist? Or will Spark look ahead to check that they are not used in any actions, and consequently chain the transformations for c and d into b and directly calculate e?

I am new to Spark and am trying to understand the basics.

Regards,

Anirban

1 ACCEPTED SOLUTION

avatar
New Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar
New Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

hmmm.. understood.

thanks @Jan Rock