Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Running a Spark Job with NiFi using Execute Process

avatar
Expert Contributor

Hi I do know there are a number of threads posted about how to run a spark job from NiFi, but most of them explain a setup on HDP.

I am using windows. I have spark and NiFi locally installed.

Can anyone explain how can I configure the Execute Process to run the following command (which I run in the command line and it works)

spark-submit2.cmd --class "SimpleApp" --master local[4] file:///C:/Simple_Project/target/scala-2.10/simple-project_2.10-1.0.jar

1 ACCEPTED SOLUTION

avatar
Guru

@Arsalan Siddiqi

You should just be able to bring up the execute processor and configure the command you have there as the command to execute. Just make sure you give it the full path the the spark-submit2.cmd executable (e.g. /usr/bin/spark-submit). As long as the file and path you are referencing is on the same machine as where Nifi is running (assuming it is only 1 box and is not clustered), and Spark client is present and configured correctly, the processor should just kick off the spark-submit. Make sure you change the scheduling to be something more than 0 seconds. Otherwise, you will quickly fill up the cluster where the job is being submitted with duplicate jobs. You can also set it to be CRON scheduled.

View solution in original post

2 REPLIES 2

avatar
Guru

@Arsalan Siddiqi

You should just be able to bring up the execute processor and configure the command you have there as the command to execute. Just make sure you give it the full path the the spark-submit2.cmd executable (e.g. /usr/bin/spark-submit). As long as the file and path you are referencing is on the same machine as where Nifi is running (assuming it is only 1 box and is not clustered), and Spark client is present and configured correctly, the processor should just kick off the spark-submit. Make sure you change the scheduling to be something more than 0 seconds. Otherwise, you will quickly fill up the cluster where the job is being submitted with duplicate jobs. You can also set it to be CRON scheduled.

avatar
Super Collaborator

Hi @Arsalan Siddiqi,

Alternate to Above response, you may take the help of Livy where you don't need to worry about configuring the NiFi Environment to include spark specific configuration, as Livy take REST requests, this works with same Execute process or ExecuteStreamCommand Process, a curl command need to be issued. this is very handy when your NiFi and Spark is running in different servers.

Please refer the Livy Documentation on that front