- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on ‎01-31-2016 08:46 PM - edited ‎08-17-2019 01:20 PM
Introduction
Spark doesn't supply a mechanism to have data pushed to it - instead, it wants to pull data from other sources. In NiFi, this data can be exposed in such a way that a receiver can pull from it by adding an Output Port to the root process group. For Spark, we will use this same mechanism - we will use the Site-to-Site protocol to pull data from NiFi's Output Ports.
Prerequisite
1) Assuming you already have latest version of NiFi-0.4.1/HDF-1.1.1 downloaded on your HW Sandbox, else execute below after ssh connectivity to sandbox is established:
# cd /opt/ # wget http://public-repo-1.hortonworks.com/HDF/1.1.1.0/nifi-1.1.1.0-12-bin.tar.gz # tar -xvf nifi-1.1.1.0-12-bin.tar.gz
2) Download Compatible version [in our case 0.4.1] of "nifi-spark-receiver" and "nifi-site-to-site-client" to Sandbox in a specific location:
# mkdir /opt/spark-receiver # cd /opt/spark-receiver # wget http://central.maven.org/maven2/org/apache/nifi/nifi-site-to-site-client/0.4.1/nifi-site-to-site-cli... # wget http://central.maven.org/maven2/org/apache/nifi/nifi-spark-receiver/0.4.1/nifi-spark-receiver-0.4.1....
Steps:
1) Configure Spark to load some specific NiFi Libraries as below, edit spark-defaults.conf to add jars to ClassPath. Append Below lines to bottom:
# vi /usr/hdp/current/spark-client/conf/spark-defaults.conf spark.driver.extraClassPath /opt/spark-receiver/nifi-spark-receiver-0.4.1.jar:/opt/spark-receiver/nifi-site-to-site-client-0.4.1.jar:/opt/nifi-1.1.1.0-12/lib/nifi-api-1.1.1.0-12.jar:/opt/nifi-1.1.1.0-12/lib/bootstrap/nifi-utils-1.1.1.0-12.jar:/opt/nifi-1.1.1.0-12/work/nar/framework/nifi-framework-nar-1.1.1.0-12.nar-unpacked/META-INF/bundled-dependencies/nifi-client-dto-1.1.1.0-12.jar spark.driver.allowMultipleContexts = true
2) Open nifi.properties for updating configurations:
# vi /opt/nifi-1.1.1.0-12/conf/nifi.properties
3) Change NIFI http port to run on 8090 as default 8080 will conflict with Ambari web UI
# web properties # nifi.web.http.port=8090
4) Configure NiFi instance to run site-to site by changing below configuration : add a port say 8055 and set "nifi.remote.input.secure" as "false"
# Site to Site properties nifi.remote.input.socket.host= nifi.remote.input.socket.port=8055 nifi.remote.input.secure=false
5) Now Start [Restart if already running as configuration change to take effect] NiFi on your Sandbox.
# /opt/nifi-1.1.1.0-12/bin/nifi.sh start
6) Let us build a small flow on NiFi canvas to read app log generated by NiFi itself to feed to spark:
a) Connect to below url in your browser: http://<your_vm_ip>:8090/nifi/
b) Drop an "ExecuteProcess" Processor to canvas [or you can use TailFile Processor] to read lines added to "nifi-app.log". Auto Terminate relationship Failure. The configuration on the processor would look like below:
c) Drop an OutputPort to the canvas and Name it 'spark', Once added, connect "ExecuteProcess" to the port for Success relationship. This simple flow will look like below:
7) Now lets go back to VM command line and create the Scala application to pull data from NiFi output port we just created: change directory to "/opt/spark-receiver" and create a shell script file "spark-data.sh"
# cd /opt/spark-receiver # vi spark-data.sh
😎 Add the below lines to the script file required for application to pull the data from NiFi output port and save it:
// Import all the libraries required import org.apache.nifi._ import java.nio.charset._ import org.apache.nifi.spark._ import org.apache.nifi.remote.client._ import org.apache.spark._ import org.apache.nifi.events._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.nifi.remote._ import org.apache.nifi.remote.client._ import org.apache.nifi.remote.protocol._ import org.apache.spark.storage._ import org.apache.spark.streaming.receiver._ import java.io._ import org.apache.spark.serializer._ object SparkNiFiAttribute { def main(args: Array[String]) { // Build a Site-to-site client config with NiFi web url and output port name[spark created in step 6c] val conf = new SiteToSiteClient.Builder().url("http://localhost:8090/nifi").portName("spark").buildConfig() // Set an App Name val config = new SparkConf().setAppName("Nifi_Spark_Data") // Create a StreamingContext val ssc = new StreamingContext(config, Seconds(10)) // Create a DStream using a NiFi receiver so that we can pull data from specified Port val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY)) // Map the data from NiFi to text, ignoring the attributes val text = lines.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8)) // Print the first ten elements of each RDD generated text.print() // Start the computation ssc.start() } } SparkNiFiAttribute.main(Array())
9) Lets Go back to the NiFi Web UI and start the flow we created, make sure nothing is wrong and you shall see data flowing
10) Now load the script to spark-shell with below command and start streaming:
# spark-shell -i spark-data.sh
11) In the screenshot below, you can see the NiFi logs being pulled and printed on the console:
12) Same way we can pull data from NiFi and extract the associated Attributes:
// Import all the libraries required import org.apache.nifi._ import java.nio.charset._ import org.apache.nifi.spark._ import org.apache.nifi.remote.client._ import org.apache.spark._ import org.apache.nifi.events._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.nifi.remote._ import org.apache.nifi.remote.client._ import org.apache.nifi.remote.protocol._ import org.apache.spark.storage._ import org.apache.spark.streaming.receiver._ import java.io._ import org.apache.spark.serializer._ object SparkNiFiData { def main(args: Array[String]) { // Build a Site-to-site client config with NiFi web url and output port name val conf = new SiteToSiteClient.Builder().url("http://localhost:8090/nifi").portName("spark").buildConfig() // Set an App Name val config = new SparkConf().setAppName("Nifi_Spark_Attributes") // Create a StreamingContext val ssc = new StreamingContext(config, Seconds(5)) // Create a DStream using a NiFi receiver so that we can pull data from specified Port val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY)) // Extract the 'uuid' attribute val text = lines.map(dataPacket => dataPacket.getAttributes.get("uuid")) text.print() ssc.start() ssc.awaitTermination() } } SparkNiFiData.main(Array())
13) In the screenshot below, you can see the FlowFile attribute "uuid" being extracted and printed on the console:
14) You can create multiple output ports to transmit data to different Spark application from same NiFi Instance at the same time.
Thanks,
Jobin George
Created on ‎01-25-2017 02:39 AM - edited ‎08-17-2019 01:19 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hello, i got a error after step 10 when run spark-data.sh. It is "nifi is not member of package org.apache". How can i fix this problem? thank you
Created on ‎01-25-2017 06:49 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@Yumin Dong Which version of HDF/NiFi are you using? if the latest one, I hope you downloaded the latest version of dependencies. Let me know.
Jobin
Created on ‎02-14-2017 12:45 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi George thanks for the wonderful tutorial..
I am trying to connect Nifi and Spark in HDInsight(Azure) cluster and ending up a lots of error..
Is this code only work for HDF only?
Please have a look on screenshot attached.untitled.png
Created on ‎02-16-2017 03:54 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi Harsh, Cant make out much from the screenshot as the main error cause is not visible. First glance, it looks more like environment issue.
Created on ‎06-08-2017 03:35 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi George
I am not using the sandbox but rather have a standalone installation of spark and nifi on my pc
I am using apache nifi 1.2.0 and I have followed the entire tutorial. I get the error in
import org.apache.nifi.events._
<console>:38: error: object events is not a member of package org.apache.nifi import org.apache.nifi.events._
I have included all the relevant jars that you have mentioned.
- nifi-site-to-site-client-1.2.0.jar
- nifi-spark-receiver-1.2.0.jar
- nifi-api-1.2.0.jar
- nifi-utils-1.2.0.jar
- nifi-client-dto-1.2.0.jar
I opened all the jars and sure enough there in no directory org.apache.nifi.events in any of the jars.
How can i find this missing import?
also i tried to run the code in intellij i dont get any errors but i get the following warning:
17/06/08 18:16:14 INFO ReceiverSupervisorImpl: Stopping receiver with message: Registered unsuccessfully because Driver refused to start receiver 0
i copied the following code in Intellij. i commented the last line
// Import all the libraries required import org.apache.nifi._ import java.nio.charset._ import org.apache.nifi.spark._ import org.apache.nifi.remote.client._ import org.apache.spark._ import org.apache.nifi.events._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.nifi.remote._ import org.apache.nifi.remote.client._ import org.apache.nifi.remote.protocol._ import org.apache.spark.storage._ import org.apache.spark.streaming.receiver._ import java.io._ import org.apache.spark.serializer._ object SparkNiFiAttribute { def main(args: Array[String]) { /* import java.util val additionalJars = new util.ArrayList[String] additionalJars.add("/home/arsalan/NiFiSparkJars/nifi-site-to-site-1.2.0.jar") */ val config = new SparkConf().setAppName("Nifi_Spark_Data") // .set("spark.driver.extraClassPath","/home/arsalan/NiFiSparkJars/nifi-site-to-site-client-1.2.0.jar:/home/arsalan/NiFiSparkJars/nifi-spark-receiver-1.2.0.jar:/home/arsalan/nifi-1.2.0/lib/nifi-api-1.2.0.jar:/home/arsalan/nifi-1.2.0/lib/bootstrap/nifi-utils-1.2.0.jar:/home/arsalan/nifi-1.2.0/work/nar/framework/nifi-framework-nar-1.2.0.nar-unpacked/META-INF/bundled-dependencies/nifi-client-dto-1.2.0.jar") .set("spark.driver.allowMultipleContexts", "true") .setMaster("local[*]") // Build a Site-to-site client config with NiFi web url and output port name[spark created in step 6c] val conf = new SiteToSiteClient.Builder().url("http://localhost:8080/nifi").portName("Data_to_Spark").buildConfig() // Set an App Name // Create a StreamingContext val ssc = new StreamingContext(config, Seconds(1)) ssc.sparkContext.getConf.getAll.foreach(println) // Create a DStream using a NiFi receiver so that we can pull data from specified Port val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY)) // Map the data from NiFi to text, ignoring the attributes val text = lines.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8)) // Print the first ten elements of each RDD generated text.print() // Start the computation ssc.start() } } //SparkNiFiAttribute.main(Array())
Created on ‎06-09-2017 09:00 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
to run the code in intellij the above code is fine! Only need to add ssc.awaitTermination() after ssc.start(). To run in shell, I need to create a fatJar (uberjar/standalone Jar)The missing import org.apache.nifi.events._ was available in nifi-framework-api-1.2.0.jar .
I used maven to create the fat jar using the maven-assembly-plugin
Created on ‎06-09-2017 07:22 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
hello @Arsalan Siddiqi,
Here you can find spark integration with HDF-2.x (still only nifi-1.1) you can figure out the dependency from there.[ first step under:Configuring and Restarting Spark section]
Thanks
Created on ‎06-10-2017 11:39 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Awesome! Love this work, great job Jobin!
Created on ‎09-12-2017 09:54 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
How can we deal with SECURED http connections?
NiFi runs on:
https://<ip>:8443
and I'm using SSL certificates for authentication
Created on ‎09-26-2017 12:27 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi @Jobin George,
Im using the open source nifi version 1.3.0
Followed steps shared by you.
Im not able to enable the transmission of the output port .nifisparkreciever.png.Have i missed anything?