I am dealing with a weird situation , where I have small tables and big tables to process using spark and it must be a single spark job.
To achieve best performance targets, I need to set a property called
<code>spark.sql.shuffle.partitions = 12 for small tables and spark.sql.shuffle.partitions = 500 for bigger tables
Inspite of having dynamic allocation as true, the number of stages and tasks are getting more , by which the performance for bigger tables is bad when I am trying to process the smaller tables.
I want to know how can I change these properties dynamically in spark ? Can I have multiple configuration files and call it within the program ?
You have various options.
1. Before spawning the job if you can get the size of the table using "hdfs -du hdfsPath" programatically. If the size is more than X set the configs depending on the size in your code.
2. If using spark-shell you can use --conf key=value and pass the params explicitly.
3. from Oozie do as step 1 but through oozie "shell action", "capture output" and set the parameter.