- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Benefit of DISK_ONLY persists
- Labels:
-
Apache Hive
-
Apache Spark
-
HDFS
Created on ‎07-26-2015 04:47 PM - edited ‎09-16-2022 02:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi dear experts!
i discovering Spark's persist capabilities and noted interesting behaivour of DISK_ONLY persistance.
as far as i understand the main goal - to store reusable and intermediate RDDs, that were produced from permanent data (that lays on HDFS).
import org.apache.spark.storage.StorageLevel val input = sc.textFile("/user/hive/warehouse/big_table"); val result = input.coalesce(600).persist(StorageLevel.DISK_ONLY) scala> result.count() …… // and repeat command …….. scala> result.count()
so, i was surprised when saw that second iteration was significantly faster...
could anybody describe why?
thanks!
Created ‎07-27-2015 04:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first case is: read - shuffle - persist - count
The second case is: read (from persisted copy) - count
You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.
Created ‎07-26-2015 11:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hm, is that surprising? You described why it is faster in your message. The second time, "result" does not have to be recomputed since it is available on disk. It is the result of a potentially expensive shuffle operation (coalesce)
Created ‎07-27-2015 02:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
but in the second case I read all dataset as in the first case (without any map operation).
so, in both casese i read whole dataset...
regarding shuffle - i use coalesce instead repartition, so it suppose to avoid shuffle operations...
Created ‎07-27-2015 04:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first case is: read - shuffle - persist - count
The second case is: read (from persisted copy) - count
You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.
