Hi @TimmehG, Spark has a configurable metrics system based on the Dropwizard Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. The metrics are generated by sources embedded in the Spark codebase. They provide instrumentation for specific activities and Spark components. The metrics system is configured via a configuration file that Spark expects to be present at $SPARK_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property. Instead of using the configuration file, a set of configuration parameters with prefix spark.metrics.conf. can be used.
I agree with you, running spark applications continuously & reliably is a challenging task, and a good performance monitoring system is needed.
Several external tools can be used to help profile the performance of Spark jobs:
- Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk-bound, network bound, or CPU bound.
- OS profiling tools such as dstat, iostat, and iotop can provide fine-grained profiling on individual nodes.
- JVM utilities such as jstack for providing stack traces, jmap for creating heap-dumps, jstat for reporting time-series statistics and jconsole for visually exploring various JVM properties are useful for those comfortable with JVM internals.
For more insights you can refer to the below links:
Please accept the answer you found most useful.