Community Articles

Find and share helpful community-sourced technical articles.
avatar
Master Guru

I want to easily integrate Apache Spark jobs with my Apache NiFi flows. Fortunately with the release of HDF 3.1, I can do that via Apache NiFi's ExecuteSparkInteractive processor.

First step, let me set up a Centos 7 cluster with HDF 3.1, follow the well-written guide here.

56659-installhdf31services.png

With the magic of time lapse photography, instantly we have a new cluster of goodness:

56657-hdf31ambariscreen.png

It is important to note the new NiFi Registry for doing version control and more. We also get the new Kafka 1.0, updated SAM and the ever important updated Schema Registry.

56658-hdf31nifi15.png

The star of the show today tis Apache NiFi 1.5 here.

My first step is to Add a Controller Service (LivySessionController).


56643-createlivysessioncontroller.png

Then we add the Apache Livy Server, you can find this in your Ambari UI. It is by default port 8999. For my session, I am doing Python, so I picked pyspark. You can also pick pyspark3 for Python 3 code, spark for Scala, and sparkr for R.

56666-livycontrollerproperties.png

To execute my Python job, you can pass the code in from a previous processor to the ExecuteSparkInteractive processor or put the code inline. I put the code inline.

56644-executesparkinteractive.png

Two new features of Schema Registry I have to mention are the version comparison:

56667-srcompareversions.png

You click the COMPARE VERSIONS link and now you have a nice comparison UI.

56656-compareschemaversions.png

And the amazing new Swagger documentation for interactive documentation and testing of the schema registry APIs.

56660-swaggersrgetschemalist.png

Not only do you get all the parameters for input and output, the full URL and a Curl example, you get to run the code live against your server.

56661-swaggerschemaregistrycreateschema.png

56662-schemaswagger3.png

56663-schemaswagger.png

56664-schemaregistryswagger.png

I will be adding an article on how to use Apache NiFi to grab schemas from data using InferAvroSchema and publish these new schemas to the Schema Registry vai REST API automagically.

Part two of this article will focus on the details of using Apache Livy + Apache NiFi + Apache Spark with the new processor to call jobs.

Part 2 -> https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte...

References

https://community.hortonworks.com/articles/148730/integrating-apache-spark-2x-jobs-with-apache-nifi....

https://community.hortonworks.com/articles/73828/submitting-spark-jobs-from-apache-nifi-using-livy.h...

10,062 Views
Comments

Is Kerberized server supported by LivySessionController ?
I tried with the same approach on Kerberized Hadoop cluster but not able to get expected results.