<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Sql for ETL performance tuning in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-Sql-for-ETL-performance-tuning/m-p/238617#M200428</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/112382/barath51777.html" nodeid="112382"&gt;@Barath Natarajan&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Check out how many executors and memory that &lt;STRONG&gt;&lt;U&gt;spark-sql cli&lt;/U&gt;&lt;/STRONG&gt; has been initialized(it seems to be running on local mode with one executor).&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;To debug the query run an &lt;STRONG&gt;explain plan&lt;/STRONG&gt; on the query.&lt;/LI&gt;&lt;LI&gt;Check out how &lt;STRONG&gt;many files&lt;/STRONG&gt; in hdfs directory for each table, if too many files then consolidate them to smaller number.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;&lt;EM&gt;Another approach would be:&lt;/EM&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;BR /&gt;-&amp;gt; Run &lt;STRONG&gt;spark-shell (or) pys&lt;/STRONG&gt;&lt;STRONG&gt;park&lt;/STRONG&gt; with &lt;STRONG&gt;local mode/yarn-clie&lt;/STRONG&gt;nt mode with more number of executors/more memory&lt;BR /&gt;-&amp;gt; Then load the tables into &lt;STRONG&gt;dataframe&lt;/STRONG&gt; and then &lt;STRONG&gt;registerTempTable(spark1.X)/createOrReplaceTempView(if using spark2)&lt;/STRONG&gt;&lt;BR /&gt;-&amp;gt; Run your join using spark.sql("&amp;lt;join query&amp;gt;")&lt;BR /&gt;-&amp;gt; Check out the performance of the query.&lt;/P&gt;</description>
    <pubDate>Fri, 19 Apr 2019 08:56:41 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2019-04-19T08:56:41Z</dc:date>
  </channel>
</rss>

