Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Spark Memory Monitoring : Need help on graphite queries to monitor spark applications


I have configured graphite and grafana for monitoring the spark applications as per "".

Are the below queries, the correct ones ?

  • Driver Memory
    • Driver Heap Usage
      aliasByNode($application.driver.jvm.heap.usage, 1)
    • Driver JVM Memory Pools Usage
      aliasByNode($application.driver.jvm.pools.*.used, 4)
  • Executor & Driver Memory Used
    aliasByNode($application.*.jvm.heap.used, 1)
  • Executor Memory Used
    aliasByNode(exclude($application.*.jvm.heap.used, '.driver.jvm.heap'), 1)
    alias(sumSeries(exclude($application.*.jvm.heap.used, '.driver.jvm.heap')), 'total')
  • Task Executor
    • Active Tasks Per Executor
      aliasByNode(summarize($application.*.executor.threadpool.activeTasks, '10s', 'sum', false), 1)
    • Completed Tasks per Executor
      aliasByNode($application.*.executor.threadpool.completeTasks, 1)
    • Completed Tasks/Minute per Executor
      aliasByNode(nonNegativeDerivative(summarize($application.*.executor.threadpool.completeTasks, '1m', 'avg', false)), 1)
  • Read/Write IOPS
    • Read IOPS
      alias(perSecond(sumSeries($application.*.executor.filesystem.hdfs.read_ops)), 'total')
      aliasByNode(perSecond($application.*.executor.filesystem.hdfs.read_ops), 1)
    • Write IOPS
      alias(perSecond(sumSeries($application.*.executor.filesystem.hdfs.write_ops)), 'total')
      aliasByNode(perSecond($application.*.executor.filesystem.hdfs.write_ops), 1)
  • HDFS Bytes Reads/Writes Per Executor
    • Executor HDFS Reads
    • Executor HDFS Bytes Written

Also does grafana and graphite provides metrices on the below use case ?

  • We have a bunch of hourly / daily batches on Airflow. This batch use PySpark for data processing.
  • We want to see historical trend of Spark memory usage on the same batch.
  • So we want to aggregate Spark applications on the same batch then visualize historical trends so we can check if how memory usage is increased based on traffic


Can someone please have a look into it