Created 10-16-2017 09:43 AM
Suppose I have the following piece of code:
val a = sc.textfile("path/to/file") val b = a.filter(<something..>).groupBy(<something..>) val c = b.filter(<something..>).groupBy(<something..>) val d = c.<some transform> val e = d.<some transform> val sum1 = e.reduce(<reduce func>) val sum2 = b.reduce(<reduce func>)
Note that I have not used any cache/persist command.
Since the RDD b is being used again in the last action, will Spark automatically cache it? Or will it be recalculated again from the dataset?
Will the behaviour be the same, if I use DataFrame for the above steps?
Lastly, at any point of time will the RDDs c or d exist? Or will Spark look ahead to check that they are not used in any actions, and consequently chain the transformations for c and d into b and directly calculate e?
I am new to Spark and am trying to understand the basics.
Regards,
Anirban
Created 10-16-2017 10:16 AM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 10-16-2017 10:16 AM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 10-17-2017 10:38 AM
hmmm.. understood.
thanks @Jan Rock