<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Benefit of DISK_ONLY persists in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30082#M6749</link>
    <description>&lt;P&gt;but in the second case I read all dataset as in the first case (without any map operation).&lt;/P&gt;&lt;P&gt;so, in both casese i read whole dataset...&lt;/P&gt;&lt;P&gt;regarding shuffle - i use&amp;nbsp;&lt;SPAN&gt;coalesce instead repartition, so it suppose to avoid shuffle operations...&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 27 Jul 2015 09:53:54 GMT</pubDate>
    <dc:creator>fil</dc:creator>
    <dc:date>2015-07-27T09:53:54Z</dc:date>
    <item>
      <title>Benefit of DISK_ONLY persists</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30071#M6747</link>
      <description>&lt;P&gt;Hi dear experts!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i discovering Spark's persist capabilities and noted interesting behaivour of&amp;nbsp;DISK_ONLY persistance.&lt;/P&gt;&lt;P&gt;as far as i understand the main goal - to store reusable and intermediate RDDs, that were produced from permanent data (that lays&amp;nbsp;on HDFS).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;import org.apache.spark.storage.StorageLevel
val input = sc.textFile("/user/hive/warehouse/big_table");
val result = input.coalesce(600).persist(StorageLevel.DISK_ONLY)
scala&amp;gt; result.count()
……
// and repeat command
……..
scala&amp;gt; result.count()&lt;/PRE&gt;&lt;P&gt;so, i was surprised when saw that second iteration was significantly faster...&lt;/P&gt;&lt;P&gt;could anybody describe why?&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Untitled.jpg" style="width: 999px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/1058iD0C972F7C3173456/image-size/large?v=v2&amp;amp;px=999" role="button" title="Untitled.jpg" alt="Untitled.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:35:36 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30071#M6747</guid>
      <dc:creator>fil</dc:creator>
      <dc:date>2022-09-16T09:35:36Z</dc:date>
    </item>
    <item>
      <title>Re: Benefit of DISK_ONLY persists</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30077#M6748</link>
      <description>&lt;P&gt;Hm, is that surprising? You described why it is faster in your message. The second time, "result" does not have to be recomputed since it is available on disk. It is the result of a potentially expensive shuffle operation (coalesce)&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jul 2015 06:53:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30077#M6748</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-07-27T06:53:01Z</dc:date>
    </item>
    <item>
      <title>Re: Benefit of DISK_ONLY persists</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30082#M6749</link>
      <description>&lt;P&gt;but in the second case I read all dataset as in the first case (without any map operation).&lt;/P&gt;&lt;P&gt;so, in both casese i read whole dataset...&lt;/P&gt;&lt;P&gt;regarding shuffle - i use&amp;nbsp;&lt;SPAN&gt;coalesce instead repartition, so it suppose to avoid shuffle operations...&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jul 2015 09:53:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30082#M6749</guid>
      <dc:creator>fil</dc:creator>
      <dc:date>2015-07-27T09:53:54Z</dc:date>
    </item>
    <item>
      <title>Re: Benefit of DISK_ONLY persists</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30092#M6750</link>
      <description>&lt;P&gt;The first case is: read - shuffle - persist - count&lt;/P&gt;&lt;P&gt;The second case is: read (from persisted copy) - count&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jul 2015 11:16:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Benefit-of-DISK-ONLY-persists/m-p/30092#M6750</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2015-07-27T11:16:41Z</dc:date>
    </item>
  </channel>
</rss>

