Reply
New Contributor
Posts: 2
Registered: ‎01-31-2017

performance comparison: running multiple hive query with one SparkContext vs running individually

Is there any performance difference in below approches

                   1 )first read all queries from hive table and running all queries paraller ( with HiveContext) using java threading

                    2) using oozie multiple spark action and run each query individual

 

Doubts:

How does spark allocats resources in first case while running on Yarn

 

Some thoughts

In 2) we can optimize jobs by setting config but how to do optimization for all queries in 1).

 

Any thougth about internal processing on yarn

 

Thanks in advance

 

 

Cloudera Employee
Posts: 97
Registered: ‎05-10-2016

Re: performance comparison: running multiple hive query with one SparkContext vs running individua

I would suggest doing some benchmarking, but there will be lots of variables that account for this, including any resource pools that may be setup.

 

You may have some improvements in running multiple queries within the same Spark context as you will have less overhead of starting the driver and seperate executor nodes.  Some of Spark's performance improvement come from reusing JVMs instead of spining up new ones.  You will need to ensure the same resources are available for each test though.  The overhead will become less significant though as processing times of your tasks increase.