Support Questions

wfloyd · ‎11-09-2015

1) How should we approach the question of persist() or cache() when running Spark on YARN. E.g. how should the Spark developer know approximately how much memory will be available to their YARN Queue and use this number to guide their persist choice()? Or should they use some other technique?

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

2) With Spark on YARN does the "RDD" exist only as long as the SparkDriver lives, as long as the RDD's related spark worker containers live, or based on some other time frame?

Thanks!

rsriharsha · ‎11-09-2015

The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.

persist and cache at the RDD level are actually the same.

persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY

but you can persist at various different storage levels.

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

View solution in original post

wfloyd · ‎11-09-2015

There is also an article on "How to size memory only RDDs" which references setting spark.executor.memory and spark.yarn.executor.memoryOverhead. Should we use these as well in planning memory/RDD usage?

https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-3-rdd-persistence/

nsabharwal · ‎11-09-2015

@Ram Sriharsha

rsriharsha · ‎11-09-2015

The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.

persist and cache at the RDD level are actually the same.

persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY

but you can persist at various different storage levels.

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

nsabharwal · ‎11-09-2015

Thanks @Ram Sriharsha from chimming in . Really appreciate it.

Cloudera Community

Support Questions

Choosing RDD Persistence and Caching with Spark on YARN

Spark RDDs vs DataFrames vs SparkSQL

Spark RDD/Dataframe caching

How to clear local file cache and user cache for y...

Cache it all! Enhancing HBase Cache for optimal pe...

Share 1 RDD between 2 Spark applications (memory p...

Spark on YARN - Executor Resource Allocation Optim...

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

Parsing XML in Spark RDD

Parse nested json using Spark RDD