Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Cloudera Employee

What is DataFlint

DataFlint is an open-source D-APM (Data-Application Performance Monitoring) for Apache Spark, built for big data engineers. For more information see dataflint.gitbook.io/dataflint-for-spark or the git-repo itself 

How to Integrate Dataflint in CDP

DataFlint supports spark 3.2+, so you will need a spark3 parcel 

For spark-submit see DataFlint: 

spark-submit
--packages io.dataflint:spark_2.12:0.1.4 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

To Install on Spark History Server:

  1. download DataFlint jar for scala2.12 
  2. Copy it to all servers that have an instance of spark-history Role to spark 3 parcel dir: 
    /opt/cloudera/parcels/SPARK3/lib/spark3/jars​
    Beware /opt/cloudera/parcels/SPARK3 is a symlink to the current SPARK3 active parcel and is controlled by Cloudera Manager.
  3. Go to CM >cluster >Spark 3 >Configuration >spark3-conf/spark-history-server.conf_role_safety_valve
    / `History Server Advanced Configuration Snippet (Safety Valve) for spark3-conf/spark-history-server.conf`
  4. And add:
    spark.history.provider=org.apache.spark.deploy.history.FsDataflintHistoryProvider​
  5. Restart Spark history server
  6. if you are seeing: 
    java.lang.ClassNotFoundException: org.apache.spark.deploy.history.FsDataflintHistoryProvider​
  7. Then, you misplace the jar

See docs DataFlint for download and more info 

that's it!

415 Views