Support Questions

TimmehG · ‎12-08-2020

I have a Spark job where I am tuning the code and settings to improve performance. However I'm not sure how to track the performance when using a shared cluster to see if I'm making things better or worse!

I've considered the following metrics, though with each of them even when running an identical job twice I can get very different values, so it doesn't seem a good way to track the performance of the job.

Aggregated resource usage (CPU and Memory)
Task time (active and total)
Job run time

Ideally I'd like to be able to track the performance of the job over time as the codebase evolves

jagadeesan · ‎12-22-2020

Hi @TimmehG, Spark has a configurable metrics system based on the Dropwizard Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. The metrics are generated by sources embedded in the Spark codebase. They provide instrumentation for specific activities and Spark components. The metrics system is configured via a configuration file that Spark expects to be present at $SPARK_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property. Instead of using the configuration file, a set of configuration parameters with prefix spark.metrics.conf. can be used.

I agree with you, running spark applications continuously & reliably is a challenging task, and a good performance monitoring system is needed.

Several external tools can be used to help profile the performance of Spark jobs:

Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk-bound, network bound, or CPU bound.
OS profiling tools such as dstat, iostat, and iotop can provide fine-grained profiling on individual nodes.
JVM utilities such as jstack for providing stack traces, jmap for creating heap-dumps, jstat for reporting time-series statistics and jconsole for visually exploring various JVM properties are useful for those comfortable with JVM internals.

For more insights you can refer to the below links:

https://spark.apache.org/docs/latest/monitoring.html

https://blog.cloudera.com/demystifying-spark-jobs-to-optimize-for-cost-and-performance/

https://www.infoq.com/articles/spark-application-monitoring-influxdb-grafana/

https://db-blog.web.cern.ch/blog/luca-canali/2017-03-measuring-apache-spark-workload-metrics-perform...

Please accept the answer you found most useful.

TimmehG · ‎01-06-2021

Thanks, that's very helpful input in terms of debugging performance issues and tuning jobs. So far as I can see there aren't any metrics that provide simple ways to track overall job performance over time on a shared cluster. Aggregated resource usage seems the closest thing, but on a shared cluster I think I will just need to accept that can vary widely depending on the state of the cluster for 2 identical job runs, so while it gives some indication of job performance, there's not really any panacea.

I think I was looking for something like I would usually track for web app (e.g. response times of a web server under a given load), which help me to spot when performance regressions happen. I guess not such a straightforward thing to do with the kinds of workload Spark handles!

Cloudera Community

Support Questions

Measuring Spark job performance