<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Joining large tables (dataframe and sql), but only want few columns: select before or after join? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109916#M33725</link>
    <description>&lt;P&gt;Concerning memory usage and efficiency, when joining two large tables with many columns but only want a few columns from each of them, is it better to select() before or after the join()? My instinct tells me to select() before the join but some other voices would be very helpful&lt;/P&gt;</description>
    <pubDate>Sat, 02 Jul 2016 20:47:46 GMT</pubDate>
    <dc:creator>jestinm</dc:creator>
    <dc:date>2016-07-02T20:47:46Z</dc:date>
    <item>
      <title>Joining large tables (dataframe and sql), but only want few columns: select before or after join?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109916#M33725</link>
      <description>&lt;P&gt;Concerning memory usage and efficiency, when joining two large tables with many columns but only want a few columns from each of them, is it better to select() before or after the join()? My instinct tells me to select() before the join but some other voices would be very helpful&lt;/P&gt;</description>
      <pubDate>Sat, 02 Jul 2016 20:47:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109916#M33725</guid>
      <dc:creator>jestinm</dc:creator>
      <dc:date>2016-07-02T20:47:46Z</dc:date>
    </item>
    <item>
      <title>Re: Joining large tables (dataframe and sql), but only want few columns: select before or after join?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109917#M33726</link>
      <description>&lt;P&gt;Yes. Reducing size of dataset before JOIN would surely help rather than other way round.&lt;/P&gt;</description>
      <pubDate>Sat, 02 Jul 2016 22:31:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109917#M33726</guid>
      <dc:creator>psingh15</dc:creator>
      <dc:date>2016-07-02T22:31:51Z</dc:date>
    </item>
    <item>
      <title>Re: Joining large tables (dataframe and sql), but only want few columns: select before or after join?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109918#M33727</link>
      <description>&lt;P&gt;Does this also hold for other methods besides JOIN? E.g., I want to do a groupBy. Should I select() before the groupBy()?&lt;/P&gt;</description>
      <pubDate>Sun, 03 Jul 2016 00:07:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109918#M33727</guid>
      <dc:creator>jestinm</dc:creator>
      <dc:date>2016-07-03T00:07:41Z</dc:date>
    </item>
    <item>
      <title>Re: Joining large tables (dataframe and sql), but only want few columns: select before or after join?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109919#M33728</link>
      <description>&lt;P&gt;Yes.  You can think of select() as the "filter" of columns where filter() filters rows.  You want to reduce the impact of the shuffle as much as possible.  Perform both of these as soon as possible.  The groupBy() is going to cause a shuffle by key (most likely).  Be careful with the groupBy().  If you can accomplish what you need to do with a reduceBy(), you should use that instead.&lt;/P&gt;&lt;P&gt;If you mean dataframe instead of dataset, SparkSQL will handle much of this optimization for you.  But if using normal RDDs, you are going to have to deal with these types of optimizations on your own.  &lt;/P&gt;</description>
      <pubDate>Sun, 03 Jul 2016 09:51:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109919#M33728</guid>
      <dc:creator>don_jernigan</dc:creator>
      <dc:date>2016-07-03T09:51:06Z</dc:date>
    </item>
    <item>
      <title>Re: Joining large tables (dataframe and sql), but only want few columns: select before or after join?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109920#M33729</link>
      <description>&lt;P&gt;Yes. A projection before any sort of transformation/action would help in computation time and storage optimization.&lt;/P&gt;</description>
      <pubDate>Mon, 04 Jul 2016 12:03:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Joining-large-tables-dataframe-and-sql-but-only-want-few/m-p/109920#M33729</guid>
      <dc:creator>psingh15</dc:creator>
      <dc:date>2016-07-04T12:03:41Z</dc:date>
    </item>
  </channel>
</rss>

