Posts: 25
Registered: ‎01-10-2017

Spark Streaming : How to make spark distribution of RDD's across cluster

We have a spark streaming application running on a cluster managed by the yarn ..i see the job is very slow and see uneven distribution

as shown in the picture where i have highlighted one node taking most of rdd's




From research found that we can repartition or colasace ...but little lost on which one to use ...

Would grealty appreciate if any of the experts in the community will help out as we are new to Spark



Posts: 642
Topics: 3
Kudos: 119
Solutions: 67
Registered: ‎08-16-2016

Re: Spark Streaming : How to make spark distribution of RDD's across cluster

Can't see the pic yet (mods need to approve it).

Both methods change the number of partitions. Repartition is usually used to increase the number while coalesce reduces it. The major difference is in the shuffle. Repartition, or increasing the number of partitions, will always induce a shuffle operation. Coalesce can also but has the change to not if it is able to collapse multiple partitions within the same executor, if not it will shuffle the data off.

Shuffle, as always, is an expensive operation. So you must determine if the cost is worth for the overall job performance.

Can you share the number of executors, number of partitions, number of workers, and any other stats around this to help better understand what is happening since the pic is not available yet?