Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

[URGENT] Failed Files reprocessing in Apache NiFi

avatar
New Contributor

Hello Team,

I have scenario, where whenever there is a failure in the ETL process with NiFi at the SOURCE layer, would like to rerun the failed ones !! I agree to the fact that Apache NiFI does not support directly the option of rerun/reprocessing of the failed files, still some kind of a hack exists and below is my approach.

  • Configure a path to move all the failed file(s) for reprocessing
  • clone the flowfile (from the original/failed ones) and make use of the UpdateAttribute option
  • Trigger a new run for the failed ones

Now, will the above approach works in case if NiFi is deployed in distributed mode?

Kindly help me to get a feasible solution for this problem statement

6 REPLIES 6

avatar
Community Manager

@kk-nifi Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our NiFi experts @MattWho @SAMSAL  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

Hello @MattWho and @SAMSAL.. could you please help me with this problem asap?

avatar
Master Mentor

@kk-nifi 

Is there more to your use case and dataflow design you can share?
Where in your dataflow is the failure happening?
With a successful operation a FlowFile is routed in majority of the processors to a "success" relationship.  Any failure results in a FlowFile being routed to a failure  or retry relationship with most processors.  

While in older versions of NiFi users would typically build a retry dataflow from the failure and/or retry relationships.  With the latest versions of NiFi "retry" has been systematically built directly in to the Processor.  The existing capability still exists but this give you even more ability in how you want to handle retry.

Example:

MattWho_0-1722263022912.png

When you select retry on a relationship like "failure" or "retry" (never select retry on success), you are given the option to specify a retry attempts, a back off policy, and max back off period.
Rather then a FlowFile being routed to the "failure" relationship when "retry" is selected, the FlowFile remains on the incoming queue to be tried again (in above example, 10 attempts will be made before the FlowFile is finally routed to the "retry" relationship.). 

The Retry Back Off Policy controls how you want to handle these retries:
 - "Penalize" applies a penalty to the FlowFile on the inbound connection.  NiFi ignores penalized FlowFiles and continues to execute on other non penalized FlowFiles until the penalty expires.
- "Yield"  triggers the processor to yield for a duration of time and then retry the current FlowFile.  This method ensure order of processing as no other FlowFile will be processed until this one is either successfully retried or all retry attempts  have been exhausted and FlowFile has finally been routed to the "failure" relationship.

The Retry Max Back Off Period controls the maximum time the FlowFile will either be penalized or max time processor will yield between retry attempts.  Penalty and yield initial time is controlled by the "Yield Duration" and "Penalty Duration" configured in the processor's "Settings" tab. The duration is repeatedly doubled with each retry until the max backoff period is reached.

--------------
Now when it comes to NiFi in a "distributed approach", I assume you mean when  you have setup a multi-node NiFi cluster?

In a multi-node NiFi cluster, each node loads its own copy of the dataflow and processes FlowFiles that are on that same node.  Node 1 is completely unaware of the specific related to any FlowFile that exists on Nodes 2, 3, 4, etc..

So you need to account for this NiFi architecture in your dataflow designs.  If there is a required order of execute for some batch of FlowFiles, you'll want to keep that batch on the same NiFi node and make sure you are configuring proper "Prioritizers" on all the connections between processor components.

There is very little detail in your failure handling.  Why are you cloning?  What is the purpose of the added UpdateAttribute processor?

Hopefully the newer "retry" options available on all relationships will help you with your use case.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt






avatar
New Contributor

what about "replay" option in the NiFi?? Can that be a better option compared to retry? @MattWho 

avatar
Master Mentor

@kk-nifi 

Replay is not the proper way to handle failures.  Failures should be handled in real-time through dataflow design. 

The "replay" option is only possible if NiFI still holds the content of the FlowFile you want to replay in its content repository.   The replay ability is really built with the intention to be used in dataflow development testing ( Replaying a FlowFile )
Replay also required numerous manual steps making it difficult to automate retry.
- First you need to execute a Provenance query.
- From list of provenance events select the event(s) you want to replay one by one.
- If content is still available you will have option to "replay" that FlowFile.

There is also another option to "Replay last event", but again only works if last FlowFile's content still exists in the NiFi node's content repository.  In your case, you talk about multiple failed FlowFiles for which this will not work to replay them all.

NiFi stores the content in content claims with the NiFi content_repository.  A content claim can hold the content for 1 too many FlowFiles.  Once all FlowFiles referencing a content claim have reached point of auto-terminate, the claimant count would be zero.  At that point the content claim will either be moved to archive or deleted depending on archive configuration.   Even if archived, it is only retained for a limited amount of time.

Also keep in mind that replay is NOT taking the original FlowFile and replaying it.  Replay generates a new FlowFile with all the same FlowFile attributes and same content as the original FlowFile.

Dataflow programatic handling is better.  On failure configure auto retry as i described or route failure to some other processor(s) (optionally for tagging, updating, etc ) and then route back to failed processor.  or better yet configured  X number of auto-retry that only routes to "failure" relationships if all retry events end up failing.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Community Manager

@kk-nifi Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: