Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark job execution in Nifi Cluster.

avatar
Contributor

I setup three node nifi cluster. I created custom SparkJobExecutor processor.

My workflow is something like this, It take input parameters from one processor and pass this to SparkJobExecutor processor and wait for either success/failure returned by SparkJobExecutor processor. Based on success/failure returned by SparkJobExecutor , my further flow will be decided.

As there is a cluster, if one node triggers Spark job, and wait for its response(it will take something 3-4 hrs.) and in between if that node goes down, Can we get final status(fail/success) to other nodes in cluster??

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Gaurav Jain

In a cluster, the only behind the scenes communications that occur are in the form of heartbeat messages sent from each node to the currently elected cluster coordinator. These heartbeat messages contain only health and status information. If the node running the spark job goes down, not only would the health and status messages stop, but there is nothing in that health and status message that would indicate the status of a currently executed spark job. The FlowFile that was used to trigger the spark job will be reloaded to the last queue it was in before the node went down. This means that when this node comes back online, it will trigger the same spark job to run again.

Thank you,

Matt

View solution in original post

5 REPLIES 5

avatar
Super Mentor

@Gaurav Jain

In a cluster, the only behind the scenes communications that occur are in the form of heartbeat messages sent from each node to the currently elected cluster coordinator. These heartbeat messages contain only health and status information. If the node running the spark job goes down, not only would the health and status messages stop, but there is nothing in that health and status message that would indicate the status of a currently executed spark job. The FlowFile that was used to trigger the spark job will be reloaded to the last queue it was in before the node went down. This means that when this node comes back online, it will trigger the same spark job to run again.

Thank you,

Matt

avatar
Contributor

It means there is no way to get success/failure status of spark job if node executing job goes down.

avatar
Super Mentor

@Gaurav Jain

If the NiFi node suddenly goes down, how is it going to notify the other nodes of this?

If the node goes down, the result of the job is neither a failure or success as NiFi defines it. The FlowFile that triggers your SparkJobExecutor should remain tied to the incoming connections until the job successfully completes or reports a failure. At that time the FlowFile would be moved to one of the corresponding relationship. If the NIFi node goes down, when it comes back up FlowFiles are restored to the last known connection. This means this FlowFile will trigger your SparkJob executor to run again.

Are you looking for a way to notify another node that the last spark job did not complete only or are you also looking for a way for that other node to then run the job? That becomes even more difficult since you must also tell the node that went down not to run again next time it starts back up.

As far as the notification goes, you might be able to build a flow using the new wait and notify processor just released in Apache NiFi 1.2.0. You could send a copy of your FlowFile to another one of your nodes before executing the Spark job and then send another FlowFile after job completes. On the other node it should receive the FlowFile and send it to a wait processor. The wait processor can be configured for a time limit. Should that time expire the FlowFile gets routed to expired relationship which can use run the job again on that node or simply send out a email alert. If the job completes before expiration time occurs, a FlowFile could be send to same node to notify successful completion which will cause wait processor to rout FlowFile to success relationship which you may choose to just terminate. Here is the doc for these new processors:

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache...

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache...

Bottom line, NiFi has no behind the scenes monitoring capability to accomplish what you are trying to do here, so a programmatic dataflow design must be used to meet this need.

Now if you are talking about a node simply becoming disconnected from the cluster, that is a different story. Just because a node disconnects does not mean it shuts down or stops running its dataflows. It will continue to run as normal and constantly attempt to reconnect.

Thanks,

Matt

avatar
Contributor

@Matt Clarke

Thanks..

avatar
Super Mentor

@Gaurav Jain

Was I able to successful answer your question? If so please mark the answer as accepted.

Thank you,

Matt