Cloudera Community

Community Articles

Find and share helpful community-sourced technical articles.

Advanced Search

TimothySpann

Master Guru

What: Executing Scala Apache Spark Code in JARS from Apache NiFi

Why: You don't want all of your scala code in a continuous block like Apache Zeppelin

Tools: Apache NiFi - Apache Livy - Apache Spark - Scala

Flows:

Option 1: Inline Scala Code

Apache Zeppelin Running the Same Scala Job (have to add the jar to interpreter for Spark and restart)

Grafana Charts of Apache NiFi Run

Log Search Helps You Find Errors

Run Code For Your Spark Class

Setting up Your ExecuteSparkInteractive Processor

Setting Up Your Spark Service for Scala

Tracking the Job in Livy UI

Tracking the Job in Spark UI

I was looking at doing this: Pull code from Git and put into a NiFi attribute, run directly. For bigger projects, you will have many classes and dependencies that may require a full IDE and SBT build cycle. Once I build a Scala jar I want to run against that.

Example Code

package com.dataflowdeveloper.example
import org.apache.spark.sql.SparkSession
class Example () {
  def run( spark: SparkSession) {
      try {
        println("Started")
        val shdf = spark.read.json("hdfs://princeton0.field.hortonworks.com:8020/smartPlugRaw")
      shdf.printSchema()
      shdf.createOrReplaceTempView("smartplug")
      val stuffdf = spark.sql("SELECT * FROM smartplug")
      stuffdf.count()
        println("Complete.")
      } catch {
        case e: Exception =>
          e.printStackTrace();
      }
  }
}


=--- Run that with


import com.dataflowdeveloper.example.Example
println("Before run")
val job = new Example()
job.run(spark)
println("After run")


=== after run
{"text\/plain":"After run"}

Import Tip

You need to put your Jar in Session.jars on the Session control and on the same directory on hdfs.

So I did /opt/demo/example.jar in Linux and

hdfs:// /opt/demo/example.jar.

Make sure Livy and NiFi have read permissions on those.

Github: https://github.com/tspannhw/livysparkjob

Github Release: https://github.com/tspannhw/livysparkjob/releases/tag/v1.1

Apache NiFi Flow Example

spark-it-up-scala.xml

Comments

TimothySpann

Master Guru

Easy to run this code from Spark Shell as well without connection to nifi

runshell.sh

/usr/hdp/current/spark2-client/bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 --jars /opt/demo/example.jar

TimothySpann

Master Guru

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/spark-data... http://spark.apache.org/docs/2.2.0/

http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html

http://spark.apache.org/docs/2.2.0/quick-start.html

https://community.hortonworks.com/articles/151164/how-to-submit-spark-application-through-livy-rest....

curl -H "Content-Type: application/json" -H "X-Requested-By: admin" -X POST -d '{"file": "/apps/example.jar","className": "com.dataflowdeveloper.example.Links"}' http://server:8999/batches

curl -H "Content-Type: application/json" -H "X-Requested-By: admin" -X POST -d '{"file": "hdfs://server:8020/apps/example_2.11-1.0.jar","className": "com.dataflowdeveloper.example.Links"}' http://server:8999/batches

FYI

18/03/14 11:54:54 INFO LineBufferedStream: stdout: 18/03/14 11:54:54 INFO Client: Source and destination file systems are the same. Not copying hdfs://server:8020/opt/demo/example.jar