Created 11-09-2015 03:52 PM
1) How should we approach the question of persist() or cache() when running Spark on YARN. E.g. how should the Spark developer know approximately how much memory will be available to their YARN Queue and use this number to guide their persist choice()? Or should they use some other technique?
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
2) With Spark on YARN does the "RDD" exist only as long as the SparkDriver lives, as long as the RDD's related spark worker containers live, or based on some other time frame?
Thanks!
Created 11-09-2015 07:05 PM
The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.
persist and cache at the RDD level are actually the same.
persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY
but you can persist at various different storage levels.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Created 11-09-2015 03:54 PM
There is also an article on "How to size memory only RDDs" which references setting spark.executor.memory and spark.yarn.executor.memoryOverhead. Should we use these as well in planning memory/RDD usage?
https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-3-rdd-persistence/
Created 11-09-2015 07:01 PM
Created 11-09-2015 07:05 PM
The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.
persist and cache at the RDD level are actually the same.
persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY
but you can persist at various different storage levels.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Created 11-09-2015 07:10 PM
Thanks @Ram Sriharsha from chimming in . Really appreciate it.