Support Questions
Find answers, ask questions, and share your expertise

How to set multiple spark configurations for single spark job

How to set multiple spark configurations for single spark job

New Contributor

I am dealing with a weird situation , where I have small tables and big tables to process using spark and it must be a single spark job.

To achieve best performance targets, I need to set a property called

<code>spark.sql.shuffle.partitions = 12 for small tables and
spark.sql.shuffle.partitions = 500 for bigger tables

Inspite of having dynamic allocation as true, the number of stages and tasks are getting more , by which the performance for bigger tables is bad when I am trying to process the smaller tables.

I want to know how can I change these properties dynamically in spark ? Can I have multiple configuration files and call it within the program ?

1 REPLY 1
Highlighted

Re: How to set multiple spark configurations for single spark job

You have various options.
1. Before spawning the job if you can get the size of the table using "hdfs -du hdfsPath" programatically. If the size is more than X set the configs depending on the size in your code.
2. If using spark-shell you can use --conf key=value and pass the params explicitly.
3. from Oozie do as step 1 but through oozie "shell action", "capture output" and set the parameter.

Don't have an account?