Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Benefit of DISK_ONLY persists

Solved Go to solution
Highlighted

Benefit of DISK_ONLY persists

Rising Star

Hi dear experts!

 

i discovering Spark's persist capabilities and noted interesting behaivour of DISK_ONLY persistance.

as far as i understand the main goal - to store reusable and intermediate RDDs, that were produced from permanent data (that lays on HDFS).

 

import org.apache.spark.storage.StorageLevel
val input = sc.textFile("/user/hive/warehouse/big_table");
val result = input.coalesce(600).persist(StorageLevel.DISK_ONLY)
scala> result.count()
……
// and repeat command
……..
scala> result.count()

so, i was surprised when saw that second iteration was significantly faster...

could anybody describe why?

Untitled.jpg

 

thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Benefit of DISK_ONLY persists

Master Collaborator

The first case is: read - shuffle - persist - count

The second case is: read (from persisted copy) - count

 

You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.

View solution in original post

3 REPLIES 3
Highlighted

Re: Benefit of DISK_ONLY persists

Master Collaborator

Hm, is that surprising? You described why it is faster in your message. The second time, "result" does not have to be recomputed since it is available on disk. It is the result of a potentially expensive shuffle operation (coalesce)

Re: Benefit of DISK_ONLY persists

Rising Star

but in the second case I read all dataset as in the first case (without any map operation).

so, in both casese i read whole dataset...

regarding shuffle - i use coalesce instead repartition, so it suppose to avoid shuffle operations...

Highlighted

Re: Benefit of DISK_ONLY persists

Master Collaborator

The first case is: read - shuffle - persist - count

The second case is: read (from persisted copy) - count

 

You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here