I am using spark sql cli for performing ETL operations on hive tables.
There is a sql script query which involves more than 4 joins into different tables along with where conditions in each joins for filtering before inserting it to a new big table.
The performance of this query is really poor even with really small amount of data in each tables.
I have tried using various properties for improvising the performance of the query.
SET spark.max.fetch.failure.per.stage = 10;
SET spark.rpc.io.serverThreads = 64;
SET spark.memory.fraction = 0.8;
SET spark.memory.offHeap.enabled = true;
SET spark.memory.offHeap.size = 3g;
SET spark.shuffle.file.buffer = 1 MB;
SET spark.unsafe.sorter.spill.reader.buffer.size = 1 MB;
SET spark.file.transferTo = false;
SET spark.shuffle.file.buffer = 1 MB;
SET spark.shuffle.unsafe.file.output.buffer = 5 MB;
SET spark.io.compression.lz4.blockSize=512KB;
SET spark.shuffle.service.index.cache.entries = 2048;
SET spark.shuffle.io.serverThreads = 128;
SET spark.shuffle.io.backLog = 8192;
SET spark.shuffle.registration.timeout = 2m;
SET spark.shuffle.registration.maxAttempt = 5;
But still the query runs for hours.
These are the questions that pops out from my mind:
- Is there any other ways to optimize and troubleshoot the query?
- Can we assume that Spark sql is not meant to handle a query with more complex joins?
- Should I break the script to multiple scripts with less number of Joins?