Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What are the important metrics to notice for each Stage in Spark UI?

avatar
Expert Contributor

I am running a spark job of hdfs file size of 182.4 gb. This is the config I passed to get the fastest computing time which was around 4 mins.

spark-submit --master yarn-cluster --executor-memory 64G --num-executors 30 --driver-memory 4g --executor-cores 4 --queue xxx test.jar

Below screenshots all the metrics report from spark UI of each stage after job completion. I want to know what factors should I be looking into while comparing this metrics from columns Min, 25th percentile, Median, 75th percentile and Max.

@Joe Widen

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • STAGE 0

9691-metrics-mem64g-s0.png

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • STAGE 1

9692-metrics-mem64g-s1.png

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • STAGE 2

9693-metrics-mem64g-s2.png

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 ACCEPTED SOLUTION

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar

Hi @Adnan Alvee. Having already narrowed performance issues down to a particular job, the primary thing you will want to look for within each stage is skew. Skew is when a small number of tasks take significantly longer to execute than others. You'll want to look specifically at task runtime to see if something is wrong.

As you drill down into slower running tasks you'll want to focus on where they are slow. Is the slowness in writing data, reading data, or computation? This can narrow things down to a particular problem with a node or maybe you don't have enough disks to handle the scratch space for shuffles.

Keep in mind that Spark scales linearly. Your processing may be slow simply due to not enough hardware. You'll want to focus on how much memory and cpu you've allocated to your executers as well has how many disks you have in each node.

It also looks as if your #executers is quite large. Consider having fewer executors with more resources per executor. Also the executor memory is memory per executor. The number you have is a bit large. Try playing with those numbers and see if it makes a difference.

Hope this helps.

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login