Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Sql for ETL performance tuning

avatar
New Contributor

I am using spark sql cli for performing ETL operations on hive tables.

There is a sql script query which involves more than 4 joins into different tables along with where conditions in each joins for filtering before inserting it to a new big table.

The performance of this query is really poor even with really small amount of data in each tables.

I have tried using various properties for improvising the performance of the query.


SET spark.max.fetch.failure.per.stage = 10;

SET spark.rpc.io.serverThreads = 64;

SET spark.memory.fraction = 0.8;

SET spark.memory.offHeap.enabled = true;

SET spark.memory.offHeap.size = 3g;

SET spark.shuffle.file.buffer = 1 MB;

SET spark.unsafe.sorter.spill.reader.buffer.size = 1 MB;

SET spark.file.transferTo = false;

SET spark.shuffle.file.buffer = 1 MB;

SET spark.shuffle.unsafe.file.output.buffer = 5 MB;

SET spark.io.compression.lz4.blockSize=512KB;

SET spark.shuffle.service.index.cache.entries = 2048;

SET spark.shuffle.io.serverThreads = 128;

SET spark.shuffle.io.backLog = 8192;

SET spark.shuffle.registration.timeout = 2m;

SET spark.shuffle.registration.maxAttempt = 5;


But still the query runs for hours.

These are the questions that pops out from my mind:

  1. Is there any other ways to optimize and troubleshoot the query?
  2. Can we assume that Spark sql is not meant to handle a query with more complex joins?
  3. Should I break the script to multiple scripts with less number of Joins?
1 REPLY 1

avatar
Master Guru

@Barath Natarajan

Check out how many executors and memory that spark-sql cli has been initialized(it seems to be running on local mode with one executor).

  1. To debug the query run an explain plan on the query.
  2. Check out how many files in hdfs directory for each table, if too many files then consolidate them to smaller number.

Another approach would be:
-> Run spark-shell (or) pyspark with local mode/yarn-client mode with more number of executors/more memory
-> Then load the tables into dataframe and then registerTempTable(spark1.X)/createOrReplaceTempView(if using spark2)
-> Run your join using spark.sql("<join query>")
-> Check out the performance of the query.