Support Questions
Find answers, ask questions, and share your expertise

spark with yarn small data also taking more time.

Hi All,


While working with spark 2.3, Hive and yarn (HDP 3.1), Every job works fine and competes gracefully. But Overall spark job takes the same time as it takes time with larger data even if we have very small data. 


For example: scheduling job on yarn takes some time either we have large data or small data. So, simple spark queries on small data also takes time and finish in 45 secs to 1+ mins (I guess, which includes yarn's scheduling and resource management time) and databases takes only few seconds to run same query.


Can we reduce the time with spark if we are using HDP 3.1 ---- 6 machines cluster. OR do we have any another mode to run spark with small data available in Hive in less time at-least for testing only.




Re: spark with yarn small data also taking more time.

Cloudera Employee



If scheduling resources to Spark jobs takes more time, then please check how many jobs are running parallelly in your cluster. Are you submitting all the jobs in the same pool?


Try to create a new pool, allocate enough resources to that pool and then submit your small data job into that pool so that your application will get scheduled immediately.