- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark Sql for ETL performance tuning
- Labels:
-
Apache Spark
Created 04-18-2019 03:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using spark sql cli for performing ETL operations on hive tables.
There is a sql script query which involves more than 4 joins into different tables along with where conditions in each joins for filtering before inserting it to a new big table.
The performance of this query is really poor even with really small amount of data in each tables.
I have tried using various properties for improvising the performance of the query.
SET spark.max.fetch.failure.per.stage = 10;
SET spark.rpc.io.serverThreads = 64;
SET spark.memory.fraction = 0.8;
SET spark.memory.offHeap.enabled = true;
SET spark.memory.offHeap.size = 3g;
SET spark.shuffle.file.buffer = 1 MB;
SET spark.unsafe.sorter.spill.reader.buffer.size = 1 MB;
SET spark.file.transferTo = false;
SET spark.shuffle.file.buffer = 1 MB;
SET spark.shuffle.unsafe.file.output.buffer = 5 MB;
SET spark.io.compression.lz4.blockSize=512KB;
SET spark.shuffle.service.index.cache.entries = 2048;
SET spark.shuffle.io.serverThreads = 128;
SET spark.shuffle.io.backLog = 8192;
SET spark.shuffle.registration.timeout = 2m;
SET spark.shuffle.registration.maxAttempt = 5;
But still the query runs for hours.
These are the questions that pops out from my mind:
- Is there any other ways to optimize and troubleshoot the query?
- Can we assume that Spark sql is not meant to handle a query with more complex joins?
- Should I break the script to multiple scripts with less number of Joins?
Created 04-19-2019 01:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Check out how many executors and memory that spark-sql cli has been initialized(it seems to be running on local mode with one executor).
- To debug the query run an explain plan on the query.
- Check out how many files in hdfs directory for each table, if too many files then consolidate them to smaller number.
Another approach would be:
-> Run spark-shell (or) pyspark with local mode/yarn-client mode with more number of executors/more memory
-> Then load the tables into dataframe and then registerTempTable(spark1.X)/createOrReplaceTempView(if using spark2)
-> Run your join using spark.sql("<join query>")
-> Check out the performance of the query.