<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Spark Sql for ETL performance tuning in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-Sql-for-ETL-performance-tuning/m-p/238616#M200427</link>
    <description>&lt;P&gt;I am using spark sql cli for performing ETL operations on hive tables.&lt;/P&gt;&lt;P&gt;There is a sql script query which involves more than 4 joins into different tables along with where conditions in each joins for filtering before inserting it to a new big table.&lt;/P&gt;&lt;P&gt;The performance of this query is really poor even with really small amount of data in each tables.&lt;/P&gt;&lt;P&gt;I have tried using various properties for improvising the performance of the query.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;SET spark.max.fetch.failure.per.stage = 10;&lt;/P&gt;&lt;P&gt;SET spark.rpc.io.serverThreads = 64;&lt;/P&gt;&lt;P&gt;SET spark.memory.fraction = 0.8;&lt;/P&gt;&lt;P&gt;SET spark.memory.offHeap.enabled = true;&lt;/P&gt;&lt;P&gt;SET spark.memory.offHeap.size = 3g;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.file.buffer = 1 MB;&lt;/P&gt;&lt;P&gt;SET spark.unsafe.sorter.spill.reader.buffer.size = 1 MB;&lt;/P&gt;&lt;P&gt;SET spark.file.transferTo = false;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.file.buffer = 1 MB;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.unsafe.file.output.buffer = 5 MB;&lt;/P&gt;&lt;P&gt;SET spark.io.compression.lz4.blockSize=512KB;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.service.index.cache.entries = 2048;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.io.serverThreads = 128;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.io.backLog = 8192;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.registration.timeout = 2m;&lt;/P&gt;&lt;P&gt;SET spark.shuffle.registration.maxAttempt = 5;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;But still the query runs for hours.&lt;/P&gt;&lt;P&gt;These are the questions that pops out from my mind:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Is there any other ways to optimize and troubleshoot the query?&lt;/LI&gt;&lt;LI&gt;Can we assume that Spark sql is not meant to handle a query with more complex joins?&lt;/LI&gt;&lt;LI&gt;Should I break the script to multiple scripts with less number of Joins?&lt;/LI&gt;&lt;/OL&gt;</description>
    <pubDate>Thu, 18 Apr 2019 22:06:40 GMT</pubDate>
    <dc:creator>barath51777</dc:creator>
    <dc:date>2019-04-18T22:06:40Z</dc:date>
  </channel>
</rss>

