Thanks, that's very helpful input in terms of debugging performance issues and tuning jobs. So far as I can see there aren't any metrics that provide simple ways to track overall job performance over time on a shared cluster. Aggregated resource usage seems the closest thing, but on a shared cluster I think I will just need to accept that can vary widely depending on the state of the cluster for 2 identical job runs, so while it gives some indication of job performance, there's not really any panacea. I think I was looking for something like I would usually track for web app (e.g. response times of a web server under a given load), which help me to spot when performance regressions happen. I guess not such a straightforward thing to do with the kinds of workload Spark handles!
... View more
I have a Spark job where I am tuning the code and settings to improve performance. However I'm not sure how to track the performance when using a shared cluster to see if I'm making things better or worse! I've considered the following metrics, though with each of them even when running an identical job twice I can get very different values, so it doesn't seem a good way to track the performance of the job. Aggregated resource usage (CPU and Memory) Task time (active and total) Job run time Ideally I'd like to be able to track the performance of the job over time as the codebase evolves
... View more