Community Articles

VidyaSargur · ‎02-20-2024

What is DataFlint

DataFlint is an open-source D-APM (Data-Application Performance Monitoring) for Apache Spark, built for big data engineers. For more information see dataflint.gitbook.io/dataflint-for-spark or the git-repo itself

DataFlint supports spark 3.2+, so you will need a spark3 parcel

For spark-submit see DataFlint:

spark-submit
--packages io.dataflint:spark_2.12:0.1.4 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

To Install on Spark History Server:

download DataFlint jar for scala2.12
Copy it to all servers that have an instance of spark-history Role to spark 3 parcel dir:
```
/opt/cloudera/parcels/SPARK3/lib/spark3/jars
```
Beware /opt/cloudera/parcels/SPARK3 is a symlink to the current SPARK3 active parcel and is controlled by Cloudera Manager.
Go to CM >cluster >Spark 3 >Configuration >spark3-conf/spark-history-server.conf_role_safety_valve
/ `History Server Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-history-server.conf`

And add:

spark.history.provider=org.apache.spark.deploy.history.FsDataflintHistoryProvider

if you are seeing:

java.lang.ClassNotFoundException: org.apache.spark.deploy.history.FsDataflintHistoryProvider