Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Choosing RDD Persistence and Caching with Spark on YARN

avatar
Expert Contributor

1) How should we approach the question of persist() or cache() when running Spark on YARN. E.g. how should the Spark developer know approximately how much memory will be available to their YARN Queue and use this number to guide their persist choice()? Or should they use some other technique?

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

2) With Spark on YARN does the "RDD" exist only as long as the SparkDriver lives, as long as the RDD's related spark worker containers live, or based on some other time frame?

Thanks!

1 ACCEPTED SOLUTION

avatar
New Contributor

The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.

persist and cache at the RDD level are actually the same.

persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY

but you can persist at various different storage levels.

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

View solution in original post

4 REPLIES 4

avatar
Expert Contributor

There is also an article on "How to size memory only RDDs" which references setting spark.executor.memory and spark.yarn.executor.memoryOverhead. Should we use these as well in planning memory/RDD usage?

https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-3-rdd-persistence/

avatar
Master Mentor

avatar
New Contributor

The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.

persist and cache at the RDD level are actually the same.

persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY

but you can persist at various different storage levels.

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

avatar
Master Mentor

Thanks @Ram Sriharsha from chimming in . Really appreciate it.