Community Articles

TimothySpann · ‎02-09-2018

I want to easily integrate Apache Spark jobs with my Apache NiFi flows. Fortunately with the release of HDF 3.1, I can do that via Apache NiFi's ExecuteSparkInteractive processor.

First step, let me set up a Centos 7 cluster with HDF 3.1, follow the well-written guide here.

With the magic of time lapse photography, instantly we have a new cluster of goodness:

It is important to note the new NiFi Registry for doing version control and more. We also get the new Kafka 1.0, updated SAM and the ever important updated Schema Registry.

The star of the show today tis Apache NiFi 1.5 here.

My first step is to Add a Controller Service (LivySessionController).

Then we add the Apache Livy Server, you can find this in your Ambari UI. It is by default port 8999. For my session, I am doing Python, so I picked pyspark. You can also pick pyspark3 for Python 3 code, spark for Scala, and sparkr for R.

To execute my Python job, you can pass the code in from a previous processor to the ExecuteSparkInteractive processor or put the code inline. I put the code inline.

Two new features of Schema Registry I have to mention are the version comparison:

You click the COMPARE VERSIONS link and now you have a nice comparison UI.

And the amazing new Swagger documentation for interactive documentation and testing of the schema registry APIs.

Not only do you get all the parameters for input and output, the full URL and a Curl example, you get to run the code live against your server.

I will be adding an article on how to use Apache NiFi to grab schemas from data using InferAvroSchema and publish these new schemas to the Schema Registry vai REST API automagically.

Part two of this article will focus on the details of using Apache Livy + Apache NiFi + Apache Spark with the new processor to call jobs.

Part 2 -> https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte...

References

https://community.hortonworks.com/articles/148730/integrating-apache-spark-2x-jobs-with-apache-nifi....

https://community.hortonworks.com/articles/73828/submitting-spark-jobs-from-apache-nifi-using-livy.h...

singhpraveen201 · ‎03-26-2018

Is Kerberized server supported by LivySessionController ?
I tried with the same approach on Kerberized Hadoop cluster but not able to get expected results.

TimothySpann · ‎04-03-2018

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/config...

Cloudera Community

Community Articles

HDF 3.1: Executing Apache Spark via ExecuteSparkInteractive via Apache Livy (Apache NiFi 1.5+) Part 1 of 2

Apache NiFi

Apache Spark

Cloudera DataFlow (CDF)

Re: HDF 3.1: Executing Apache Spark via ExecuteSparkInteractive via Apache Livy (Apache NiFi 1.5+) Part 1 of 2

Re: HDF 3.1: Executing Apache Spark via ExecuteSparkInteractive via Apache Livy (Apache NiFi 1.5+) Part 1 of 2

Executing TensorFlow Classifications from Apache N...

Integrating Apache Spark 2.x Jobs with Apache NiFi...

Apache NiFi - Part 2 (Twitter Flow)

Apache NiFi Journey from HDF 3.0 to 3.1 - Part 2

Versioned DataFlows with Apache NiFi 1.5 and Apach...

Apache Ambari Workflow Manager View for Apache Ooz...

Monitoring Your Containers with SysDig from HDF 3....

Apache NiFi Journey from HDF 3.0 to 3.1 - Part 3

Integrating TensorFlow 1.6 Image Labelling with HD...

NiFi/HDF Dataflow Optimization (Part 1 of 2)