Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

performance comparison: running multiple hive query with one SparkContext vs running individually

Highlighted

performance comparison: running multiple hive query with one SparkContext vs running individually

New Contributor

Is there any performance difference in below approches

                   1 )first read all queries from hive table and running all queries paraller ( with HiveContext) using java threading

                    2) using oozie multiple spark action and run each query individual

 

Doubts:

How does spark allocats resources in first case while running on Yarn

 

Some thoughts

In 2) we can optimize jobs by setting config but how to do optimization for all queries in 1).

 

Any thougth about internal processing on yarn

 

Thanks in advance

 

 

1 REPLY 1

Re: performance comparison: running multiple hive query with one SparkContext vs running individua

Expert Contributor

I would suggest doing some benchmarking, but there will be lots of variables that account for this, including any resource pools that may be setup.

 

You may have some improvements in running multiple queries within the same Spark context as you will have less overhead of starting the driver and seperate executor nodes.  Some of Spark's performance improvement come from reusing JVMs instead of spining up new ones.  You will need to ensure the same resources are available for each test though.  The overhead will become less significant though as processing times of your tasks increase.