01-31-2017 05:24 PM
Is there any performance difference in below approches
1 )first read all queries from hive table and running all queries paraller ( with HiveContext) using java threading
2) using oozie multiple spark action and run each query individual
How does spark allocats resources in first case while running on Yarn
In 2) we can optimize jobs by setting config but how to do optimization for all queries in 1).
Any thougth about internal processing on yarn
Thanks in advance
02-02-2017 06:49 AM
I would suggest doing some benchmarking, but there will be lots of variables that account for this, including any resource pools that may be setup.
You may have some improvements in running multiple queries within the same Spark context as you will have less overhead of starting the driver and seperate executor nodes. Some of Spark's performance improvement come from reusing JVMs instead of spining up new ones. You will need to ensure the same resources are available for each test though. The overhead will become less significant though as processing times of your tasks increase.