<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark RDD/Dataframe caching in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221106#M182980</link>
    <description>&lt;P&gt;Hi Anirban,&lt;/P&gt;&lt;P&gt;All transformations in Spark are &lt;EM&gt;lazy&lt;/EM&gt;, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through &lt;CODE&gt;map&lt;/CODE&gt; will be used in a &lt;CODE&gt;reduce&lt;/CODE&gt; and return only the result of the &lt;CODE&gt;reduce&lt;/CODE&gt; to the driver, rather than the larger mapped dataset.&lt;/P&gt;&lt;P&gt;By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also &lt;EM&gt;persist&lt;/EM&gt; an RDD in memory using the &lt;CODE&gt;persist&lt;/CODE&gt; (or &lt;CODE&gt;cache&lt;/CODE&gt;) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.&lt;/P&gt;&lt;P&gt;More: &lt;A href="http://spark.apache.org/docs/2.1.1/programming-guide.html" target="_blank"&gt;http://spark.apache.org/docs/2.1.1/programming-guide.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Jan&lt;/P&gt;</description>
    <pubDate>Mon, 16 Oct 2017 17:16:30 GMT</pubDate>
    <dc:creator>YOTTALABS</dc:creator>
    <dc:date>2017-10-16T17:16:30Z</dc:date>
    <item>
      <title>Spark RDD/Dataframe caching</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221105#M182979</link>
      <description>&lt;P&gt;Suppose I have the following piece of code: &lt;/P&gt;&lt;PRE&gt;val a = sc.textfile("path/to/file")
val b = a.filter(&amp;lt;something..&amp;gt;).groupBy(&amp;lt;something..&amp;gt;)
val c = b.filter(&amp;lt;something..&amp;gt;).groupBy(&amp;lt;something..&amp;gt;)
val d = c.&amp;lt;some transform&amp;gt;
val e = d.&amp;lt;some transform&amp;gt;
val sum1 = e.reduce(&amp;lt;reduce func&amp;gt;)
val sum2 = b.reduce(&amp;lt;reduce func&amp;gt;)
&lt;/PRE&gt;&lt;P&gt;Note that I have not used any cache/persist command. &lt;BR /&gt;&lt;BR /&gt;Since the RDD &lt;STRONG&gt;b &lt;/STRONG&gt;is being used again in the last action, will Spark automatically cache it? Or will it be recalculated again from the dataset? &lt;BR /&gt;&lt;BR /&gt;Will the behaviour be the same, if I use DataFrame for the above steps?&lt;/P&gt;&lt;P&gt;Lastly, at any point of time will the RDDs &lt;STRONG&gt;c &lt;/STRONG&gt;or &lt;STRONG&gt;d &lt;/STRONG&gt;exist? Or will Spark look ahead to check that they are not used in any actions, and consequently chain the transformations for &lt;STRONG&gt;c &lt;/STRONG&gt;and &lt;STRONG&gt;d &lt;/STRONG&gt;into &lt;STRONG&gt;b &lt;/STRONG&gt;and directly calculate &lt;STRONG&gt;e&lt;/STRONG&gt;?&lt;/P&gt;&lt;P&gt;I am new to Spark and am trying to understand the basics. &lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Anirban&lt;/P&gt;</description>
      <pubDate>Mon, 16 Oct 2017 16:43:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221105#M182979</guid>
      <dc:creator>nrbndsdb0509</dc:creator>
      <dc:date>2017-10-16T16:43:05Z</dc:date>
    </item>
    <item>
      <title>Re: Spark RDD/Dataframe caching</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221106#M182980</link>
      <description>&lt;P&gt;Hi Anirban,&lt;/P&gt;&lt;P&gt;All transformations in Spark are &lt;EM&gt;lazy&lt;/EM&gt;, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through &lt;CODE&gt;map&lt;/CODE&gt; will be used in a &lt;CODE&gt;reduce&lt;/CODE&gt; and return only the result of the &lt;CODE&gt;reduce&lt;/CODE&gt; to the driver, rather than the larger mapped dataset.&lt;/P&gt;&lt;P&gt;By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also &lt;EM&gt;persist&lt;/EM&gt; an RDD in memory using the &lt;CODE&gt;persist&lt;/CODE&gt; (or &lt;CODE&gt;cache&lt;/CODE&gt;) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.&lt;/P&gt;&lt;P&gt;More: &lt;A href="http://spark.apache.org/docs/2.1.1/programming-guide.html" target="_blank"&gt;http://spark.apache.org/docs/2.1.1/programming-guide.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Jan&lt;/P&gt;</description>
      <pubDate>Mon, 16 Oct 2017 17:16:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221106#M182980</guid>
      <dc:creator>YOTTALABS</dc:creator>
      <dc:date>2017-10-16T17:16:30Z</dc:date>
    </item>
    <item>
      <title>Re: Spark RDD/Dataframe caching</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221107#M182981</link>
      <description>&lt;P&gt;hmmm.. understood. &lt;/P&gt;&lt;P&gt;thanks &lt;A rel="user" href="https://community.cloudera.com/users/44447/rock.html" nodeid="44447"&gt;@Jan Rock&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Oct 2017 17:38:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-RDD-Dataframe-caching/m-p/221107#M182981</guid>
      <dc:creator>nrbndsdb0509</dc:creator>
      <dc:date>2017-10-17T17:38:12Z</dc:date>
    </item>
  </channel>
</rss>

