Created 12-08-2020 09:00 AM
I have a Spark job where I am tuning the code and settings to improve performance. However I'm not sure how to track the performance when using a shared cluster to see if I'm making things better or worse!
I've considered the following metrics, though with each of them even when running an identical job twice I can get very different values, so it doesn't seem a good way to track the performance of the job.
Ideally I'd like to be able to track the performance of the job over time as the codebase evolves
Created 12-22-2020 10:15 PM
Hi @TimmehG, Spark has a configurable metrics system based on the Dropwizard Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. The metrics are generated by sources embedded in the Spark codebase. They provide instrumentation for specific activities and Spark components. The metrics system is configured via a configuration file that Spark expects to be present at $SPARK_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property. Instead of using the configuration file, a set of configuration parameters with prefix spark.metrics.conf. can be used.
I agree with you, running spark applications continuously & reliably is a challenging task, and a good performance monitoring system is needed.
Several external tools can be used to help profile the performance of Spark jobs:
For more insights you can refer to the below links:
https://spark.apache.org/docs/latest/monitoring.html
https://blog.cloudera.com/demystifying-spark-jobs-to-optimize-for-cost-and-performance/
https://www.infoq.com/articles/spark-application-monitoring-influxdb-grafana/
Please accept the answer you found most useful.
Created 01-06-2021 03:26 AM
Thanks, that's very helpful input in terms of debugging performance issues and tuning jobs. So far as I can see there aren't any metrics that provide simple ways to track overall job performance over time on a shared cluster. Aggregated resource usage seems the closest thing, but on a shared cluster I think I will just need to accept that can vary widely depending on the state of the cluster for 2 identical job runs, so while it gives some indication of job performance, there's not really any panacea.
I think I was looking for something like I would usually track for web app (e.g. response times of a web server under a given load), which help me to spot when performance regressions happen. I guess not such a straightforward thing to do with the kinds of workload Spark handles!