Support Questions

Find answers, ask questions, and share your expertise

Trigger an spark application from NiFi

avatar

My Requirement is as soon as source put the files my spark job should get triggered out and process the files.

Currently i am thinking to do like below.

1. Source will push the files in local directory /temp/abc

2. NiFi ListFiles and fetchFile will take care of ingestion of those files into HDFS.

3. On success relationship of putHDFS thinking to setup executeStreamCommand.

Could you please suggest is there any best approach to do it? what will be the configuration for executeStreamCommand?

Thanks in advance,

R

5 REPLIES 5

avatar
Master Guru
@RAUI

Once you ingest the files into HDFS then only you need to trigger the spark application then using ExecuteStreamCommand processor would be the correct approach as this processor accepts incoming connection and triggers spark applications.

if you are having more than one file that you are storing into HDFS then it would be better to use MergeContent processor after PutHDFS processor.

Configure the MergeContent processor to wait for minimum number of entries (or) by using Max Bin Age..etc property because if you connect Success relation from PutHDFS to ExecuteStreamCommand processor as soon as first file written to HDFS then application is going to be triggered from NiFi we are not waiting for all the files stored into HDFS directory.

Triggering shell script using ExecuteStreamCommand processor configs:

77575-estreamcommand.png

avatar

@Shu thanks for your answer,i have one doubt about your answer the command argument should not take the shell script as input instead of that we should have shell script in command path which is tried and tested.

avatar
Master Guru

@RAUI

It depends on how your bashrc file is configured, if your path variable in bashrc file having the directory path that your script is in.

Then you don't need to specify bash in command path argument

76593-estreamcommand.png

avatar
Master Guru

@RAUI

Does the answer helpful to resolve your issue..!!

Take a moment to Log in and Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues and close this thread.

avatar

@RAUI Another option is to build a spark-streaming application that pulls those files directly from hdfs and process them.

https://spark.apache.org/docs/latest/streaming-programming-guide.html#file-streams

HTH