Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can I use NiFi to launch Spark (or other YARN) jobs?

avatar

I'm aware of ExecuteProcess, which could invoke spark-submit, but I'm not running NiFi on an HDP node.

I receive lots of arbitrary CSV and JSON files that I don't have pre-existing tables for. Instead of trying to script DDL creation inside NiFi, it would be nice to invoke a Spark job that infers schema and creates tables from data already loaded to HDFS.

1 ACCEPTED SOLUTION

avatar
Guru

@Randy Gelhausen

You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

Can we use spark's rest API to invoke the job when the flow file hits the invokehttp processor?

http://arturmkrtchyan.com/apache-spark-hidden-rest-api

avatar

this looks interesting, but would require an already running spark-application and ability to communicate with the correct hadoop worker-node, which doesn't seem straight-forward. Your idea did make me think about YARN RM's REST API, so have an upvote. Still want to see if there's a more straightforward suggestion, so will leave the q open.

avatar
Expert Contributor

https://github.com/spark-jobserver/spark-jobserver#ad-hoc-mode---single-unrelated-jobs-transient-con... details jobs to be started from the spark job server if there is one present. I don't be believe the hortonworks stack has it by default but it could still be a good option if this is a requirement

avatar
Guru

While it is quite convenient from an API perspective, Spark is a very heavy solution for Inferring schema from individual CSV and JSON files, unless they are very large.

A better solution to this would be to use NiFi to infer schema. The latest version of HDF includes the InferAvroSchema processor. This will take CSV or JSON files and attach an Avro Schema to the flow file as an attribute. You can then use this with Convert processors to get schema based data into a Database, or onto HDFS for example.

avatar

Great solution to schema inference, @Simon Elliston Ball, but I still have the question about Spark and/or other YARN job launching from NiFi

avatar
New Contributor

It's an old post, but the question: you get the schema, but you don't have the table. Howto create the table based on schema? With Spark it's easy, but I can't see any solution in NIFI.

avatar

@BI_Gabor

 

Yes, this thread is older and was marked 'Solved' in April of 2016; you would have a better chance of receiving a resolution by starting a new thread. This will also provide the opportunity to provide details specific to your question that could aid others in providing a more accurate answer. 

 

 

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Guru

@Randy Gelhausen

You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.

avatar
Master Guru

Spark Livy API will be supported eventually.

Also use site-to-site to trigger spark streaming

or use Kafka to trigger spark streaming