Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Sql for ETL performance tuning

Highlighted

Spark Sql for ETL performance tuning

New Contributor

I am using spark sql cli for performing ETL operations on hive tables.

There is a sql script query which involves more than 4 joins into different tables along with where conditions in each joins for filtering before inserting it to a new big table.

The performance of this query is really poor even with really small amount of data in each tables.

I have tried using various properties for improvising the performance of the query.


SET spark.max.fetch.failure.per.stage = 10;

SET spark.rpc.io.serverThreads = 64;

SET spark.memory.fraction = 0.8;

SET spark.memory.offHeap.enabled = true;

SET spark.memory.offHeap.size = 3g;

SET spark.shuffle.file.buffer = 1 MB;

SET spark.unsafe.sorter.spill.reader.buffer.size = 1 MB;

SET spark.file.transferTo = false;

SET spark.shuffle.file.buffer = 1 MB;

SET spark.shuffle.unsafe.file.output.buffer = 5 MB;

SET spark.io.compression.lz4.blockSize=512KB;

SET spark.shuffle.service.index.cache.entries = 2048;

SET spark.shuffle.io.serverThreads = 128;

SET spark.shuffle.io.backLog = 8192;

SET spark.shuffle.registration.timeout = 2m;

SET spark.shuffle.registration.maxAttempt = 5;


But still the query runs for hours.

These are the questions that pops out from my mind:

  1. Is there any other ways to optimize and troubleshoot the query?
  2. Can we assume that Spark sql is not meant to handle a query with more complex joins?
  3. Should I break the script to multiple scripts with less number of Joins?
1 REPLY 1

Re: Spark Sql for ETL performance tuning

Super Guru

@Barath Natarajan

Check out how many executors and memory that spark-sql cli has been initialized(it seems to be running on local mode with one executor).

  1. To debug the query run an explain plan on the query.
  2. Check out how many files in hdfs directory for each table, if too many files then consolidate them to smaller number.

Another approach would be:
-> Run spark-shell (or) pyspark with local mode/yarn-client mode with more number of executors/more memory
-> Then load the tables into dataframe and then registerTempTable(spark1.X)/createOrReplaceTempView(if using spark2)
-> Run your join using spark.sql("<join query>")
-> Check out the performance of the query.

Don't have an account?
Coming from Hortonworks? Activate your account here