Support Questions

rgelhausen · ‎04-14-2016

I'm aware of ExecuteProcess, which could invoke spark-submit, but I'm not running NiFi on an HDP node.

I receive lots of arbitrary CSV and JSON files that I don't have pre-existing tables for. Instead of trying to script DDL creation inside NiFi, it would be nice to invoke a Spark job that infers schema and creates tables from data already loaded to HDFS.

vvaks · ‎04-26-2016

@Randy Gelhausen

You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.

View solution in original post

cgambino · ‎04-14-2016

Can we use spark's rest API to invoke the job when the flow file hits the invokehttp processor?

http://arturmkrtchyan.com/apache-spark-hidden-rest-api

rgelhausen · ‎04-14-2016

this looks interesting, but would require an already running spark-application and ability to communicate with the correct hadoop worker-node, which doesn't seem straight-forward. Your idea did make me think about YARN RM's REST API, so have an upvote. Still want to see if there's a more straightforward suggestion, so will leave the q open.

cgambino · ‎04-14-2016

https://github.com/spark-jobserver/spark-jobserver#ad-hoc-mode---single-unrelated-jobs-transient-con... details jobs to be started from the spark job server if there is one present. I don't be believe the hortonworks stack has it by default but it could still be a good option if this is a requirement

sball · ‎04-15-2016

While it is quite convenient from an API perspective, Spark is a very heavy solution for Inferring schema from individual CSV and JSON files, unless they are very large.

A better solution to this would be to use NiFi to infer schema. The latest version of HDF includes the InferAvroSchema processor. This will take CSV or JSON files and attach an Avro Schema to the flow file as an attribute. You can then use this with Convert processors to get schema based data into a Database, or onto HDFS for example.

rgelhausen · ‎04-25-2016

Great solution to schema inference, @Simon Elliston Ball, but I still have the question about Spark and/or other YARN job launching from NiFi

BI_Gabor · ‎02-20-2020

It's an old post, but the question: you get the schema, but you don't have the table. Howto create the table based on schema? With Spark it's easy, but I can't see any solution in NIFI.

ask_bill_brooks · ‎02-20-2020

@BI_Gabor,

Yes, this thread is older and was marked 'Solved' in April of 2016; you would have a better chance of receiving a resolution by starting a new thread. This will also provide the opportunity to provide details specific to your question that could aid others in providing a more accurate answer.

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

vvaks · ‎04-26-2016

@Randy Gelhausen

You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.

TimothySpann · ‎09-22-2016

Spark Livy API will be supported eventually.

Also use site-to-site to trigger spark streaming

or use Kafka to trigger spark streaming

Cloudera Community

Support Questions

Can I use NiFi to launch Spark (or other YARN) jobs?