Created 04-14-2016 10:48 PM
I'm aware of ExecuteProcess, which could invoke spark-submit, but I'm not running NiFi on an HDP node.
I receive lots of arbitrary CSV and JSON files that I don't have pre-existing tables for. Instead of trying to script DDL creation inside NiFi, it would be nice to invoke a Spark job that infers schema and creates tables from data already loaded to HDFS.
Created 04-26-2016 01:05 PM
You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.
Created 04-14-2016 10:58 PM
Can we use spark's rest API to invoke the job when the flow file hits the invokehttp processor?
Created 04-14-2016 11:03 PM
this looks interesting, but would require an already running spark-application and ability to communicate with the correct hadoop worker-node, which doesn't seem straight-forward. Your idea did make me think about YARN RM's REST API, so have an upvote. Still want to see if there's a more straightforward suggestion, so will leave the q open.
Created 04-14-2016 11:14 PM
https://github.com/spark-jobserver/spark-jobserver#ad-hoc-mode---single-unrelated-jobs-transient-con... details jobs to be started from the spark job server if there is one present. I don't be believe the hortonworks stack has it by default but it could still be a good option if this is a requirement
Created 04-15-2016 05:17 PM
While it is quite convenient from an API perspective, Spark is a very heavy solution for Inferring schema from individual CSV and JSON files, unless they are very large.
A better solution to this would be to use NiFi to infer schema. The latest version of HDF includes the InferAvroSchema processor. This will take CSV or JSON files and attach an Avro Schema to the flow file as an attribute. You can then use this with Convert processors to get schema based data into a Database, or onto HDFS for example.
Created 04-25-2016 03:54 PM
Great solution to schema inference, @Simon Elliston Ball, but I still have the question about Spark and/or other YARN job launching from NiFi
Created 02-20-2020 07:33 AM
It's an old post, but the question: you get the schema, but you don't have the table. Howto create the table based on schema? With Spark it's easy, but I can't see any solution in NIFI.
Created 02-20-2020 09:15 AM
Yes, this thread is older and was marked 'Solved' in April of 2016; you would have a better chance of receiving a resolution by starting a new thread. This will also provide the opportunity to provide details specific to your question that could aid others in providing a more accurate answer.
Created 04-26-2016 01:05 PM
You should be able to use the ExecuteProcess processor to execute a spark-submit assuming the jar with your job code is already available on that system. You would need to have the jar with your job code and the spark client available on each of the Nifi cluster nodes but spark-submit would just call out to Yarn RM assuming you have a direct network path from Nifi to Yarn (or maybe you are running a Spark stand alone on the same cluster). I do agree with Simon that just using Nifi for most of this is probably a better solution.
Created 09-22-2016 08:59 PM
Spark Livy API will be supported eventually.
Also use site-to-site to trigger spark streaming
or use Kafka to trigger spark streaming