<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: DataFrames with Kryo serialization in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167791#M37144</link>
    <description>&lt;P&gt;When using RDD’s in your Java or Scala Spark code, Spark distributes the data to nodes within the cluster by using the default Java serialization. For Java and Scala objects, Spark has to send the data and structure between nodes. Java serialization doesn’t result in small byte-arrays, whereas Kyro serialization does produce smaller  byte-arrays. Thus, you can store more using the same amount of memory when using Kyro. Furthermore, you can also add compression such as snappy. &lt;/P&gt;&lt;P&gt;WIth RDD's and Java serialization there is also an additional overhead of garbage collection.&lt;/P&gt;&lt;P&gt;If your working with RDD's, use Kyro serialization. &lt;/P&gt;&lt;P&gt;With DataFrames, a schema is used to describe the data and Spark only passes data between nodes, not the structure. Thus, for certain types of computation on specific file formats you can expect faster performance. &lt;/P&gt;&lt;P&gt;It's not 100% true that DataFrames always outperform RDD's. Please see my post here: &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/content/kbentry/42027/rdd-vs-dataframe-vs-sparksql.html" target="_blank"&gt;https://community.hortonworks.com/content/kbentry/42027/rdd-vs-dataframe-vs-sparksql.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 09 Aug 2016 03:45:05 GMT</pubDate>
    <dc:creator>bmathew</dc:creator>
    <dc:date>2016-08-09T03:45:05Z</dc:date>
    <item>
      <title>DataFrames with Kryo serialization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167790#M37143</link>
      <description>&lt;P&gt;When using DataFrames (Dataset&amp;lt;Row&amp;gt;), there's no option for an Encoder. Does that mean DataFrames (since it builds on top of an RDD) uses Java serialization? Does using Kyro make sense as an optimization here?If not, what's the difference between Java/Kyro serialization, Tungsten, and Encoders?
Thank you!&lt;/P&gt;</description>
      <pubDate>Mon, 08 Aug 2016 05:48:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167790#M37143</guid>
      <dc:creator>jestinm</dc:creator>
      <dc:date>2016-08-08T05:48:07Z</dc:date>
    </item>
    <item>
      <title>Re: DataFrames with Kryo serialization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167791#M37144</link>
      <description>&lt;P&gt;When using RDD’s in your Java or Scala Spark code, Spark distributes the data to nodes within the cluster by using the default Java serialization. For Java and Scala objects, Spark has to send the data and structure between nodes. Java serialization doesn’t result in small byte-arrays, whereas Kyro serialization does produce smaller  byte-arrays. Thus, you can store more using the same amount of memory when using Kyro. Furthermore, you can also add compression such as snappy. &lt;/P&gt;&lt;P&gt;WIth RDD's and Java serialization there is also an additional overhead of garbage collection.&lt;/P&gt;&lt;P&gt;If your working with RDD's, use Kyro serialization. &lt;/P&gt;&lt;P&gt;With DataFrames, a schema is used to describe the data and Spark only passes data between nodes, not the structure. Thus, for certain types of computation on specific file formats you can expect faster performance. &lt;/P&gt;&lt;P&gt;It's not 100% true that DataFrames always outperform RDD's. Please see my post here: &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/content/kbentry/42027/rdd-vs-dataframe-vs-sparksql.html" target="_blank"&gt;https://community.hortonworks.com/content/kbentry/42027/rdd-vs-dataframe-vs-sparksql.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Aug 2016 03:45:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167791#M37144</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2016-08-09T03:45:05Z</dc:date>
    </item>
    <item>
      <title>Re: DataFrames with Kryo serialization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167792#M37145</link>
      <description>&lt;P&gt;Hi Binu, thanks for the answer, but since for DataFrames, Spark still passes data between nodes, does Kryo still make sense as an optimization?&lt;/P&gt;</description>
      <pubDate>Tue, 09 Aug 2016 03:58:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167792#M37145</guid>
      <dc:creator>jestinm</dc:creator>
      <dc:date>2016-08-09T03:58:16Z</dc:date>
    </item>
    <item>
      <title>Re: DataFrames with Kryo serialization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167793#M37146</link>
      <description>&lt;P&gt; use kyro when working with RDD's. prob won't help with DatFrames. I never used kyro with DataFrames. maybe you can test and post your results&lt;/P&gt;</description>
      <pubDate>Tue, 09 Aug 2016 04:15:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DataFrames-with-Kryo-serialization/m-p/167793#M37146</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2016-08-09T04:15:56Z</dc:date>
    </item>
  </channel>
</rss>

