- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Choosing RDD Persistence and Caching with Spark on YARN
- Labels:
-
Apache Spark
-
Apache YARN
Created ‎11-09-2015 03:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) How should we approach the question of persist() or cache() when running Spark on YARN. E.g. how should the Spark developer know approximately how much memory will be available to their YARN Queue and use this number to guide their persist choice()? Or should they use some other technique?
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
2) With Spark on YARN does the "RDD" exist only as long as the SparkDriver lives, as long as the RDD's related spark worker containers live, or based on some other time frame?
Thanks!
Created ‎11-09-2015 07:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.
persist and cache at the RDD level are actually the same.
persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY
but you can persist at various different storage levels.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Created ‎11-09-2015 03:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is also an article on "How to size memory only RDDs" which references setting spark.executor.memory and spark.yarn.executor.memoryOverhead. Should we use these as well in planning memory/RDD usage?
https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-3-rdd-persistence/
Created ‎11-09-2015 07:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎11-09-2015 07:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The RDD exists only as long as the spark driver lives. if one or more of the spark worker containers die the portions of the RDDs will be recomputed and cached.
persist and cache at the RDD level are actually the same.
persist has more options though: the default behavior of persist is StorageLevel.MEMORY_ONLY
but you can persist at various different storage levels.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Created ‎11-09-2015 07:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Ram Sriharsha from chimming in . Really appreciate it.
