Support Questions
Find answers, ask questions, and share your expertise

performance comparison: running multiple hive query with one SparkContext vs running individually

New Contributor

Is there any performance difference in below approches

                   1 )first read all queries from hive table and running all queries paraller ( with HiveContext) using java threading

                    2) using oozie multiple spark action and run each query individual



How does spark allocats resources in first case while running on Yarn


Some thoughts

In 2) we can optimize jobs by setting config but how to do optimization for all queries in 1).


Any thougth about internal processing on yarn


Thanks in advance




Expert Contributor

I would suggest doing some benchmarking, but there will be lots of variables that account for this, including any resource pools that may be setup.


You may have some improvements in running multiple queries within the same Spark context as you will have less overhead of starting the driver and seperate executor nodes.  Some of Spark's performance improvement come from reusing JVMs instead of spining up new ones.  You will need to ensure the same resources are available for each test though.  The overhead will become less significant though as processing times of your tasks increase.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.