I am using spark sql cli for performing ETL operations on hive tables. There is a sql script query which involves more than 4 joins into different tables along with where conditions in each joins for filtering before inserting it to a new big table. The performance of this query is really poor even with really small amount of data in each tables. I have tried using various properties for improvising the performance of the query. SET spark.max.fetch.failure.per.stage = 10; SET spark.rpc.io.serverThreads = 64; SET spark.memory.fraction = 0.8; SET spark.memory.offHeap.enabled = true; SET spark.memory.offHeap.size = 3g; SET spark.shuffle.file.buffer = 1 MB; SET spark.unsafe.sorter.spill.reader.buffer.size = 1 MB; SET spark.file.transferTo = false; SET spark.shuffle.file.buffer = 1 MB; SET spark.shuffle.unsafe.file.output.buffer = 5 MB; SET spark.io.compression.lz4.blockSize=512KB; SET spark.shuffle.service.index.cache.entries = 2048; SET spark.shuffle.io.serverThreads = 128; SET spark.shuffle.io.backLog = 8192; SET spark.shuffle.registration.timeout = 2m; SET spark.shuffle.registration.maxAttempt = 5; But still the query runs for hours. These are the questions that pops out from my mind: Is there any other ways to optimize and troubleshoot the query? Can we assume that Spark sql is not meant to handle a query with more complex joins? Should I break the script to multiple scripts with less number of Joins?
... View more