Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Apache Nifi DuplicateFlowFile processor usage question

avatar
Contributor

Hi

I have a question on our Nifi flow design and specifically about the usage of DuplicateFlowFile processor.

The latest documentation mentions that, it is intended for load test.
Does that mean, it can not be used in regular production flows? Please confirm.


https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/nifi-2.0.0-M4/or...

"Intended for load testing, this processor will create the configured number of copies of each incoming FlowFile."

 

Please verify the current and improved designs below and advise if any implications with the new approach (where DuplicateFlowFile processor is used to clone flow files)

Current Design:

 

shiva239_1-1725481829248.png

 

New Design is below (with DuplicateFlowFile processor)

shiva239_2-1725481952414.png

 

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

avatar
Super Guru

Hi @shiva239 ,

I dont think there is a problem using the DuplicateFlowFile processor in production environment. If its really intended for test environment then it wont be provided as an option to begin with. Actually when I search the community for question around flowfile replication I found this post that mentions this processor as part of the solution but I dont see any comments advising against it siting test vs prod regardless if it helped in that case or not.

However, if you are not comfortable using the Duplicate processor, there are of course different ways of doing the same thing in Nifi. Lets assume you want to write code to address this problem instead of using Nifi, how would you do it without having to repeat the same code for each target DB? I think one of the first things that comes to mind is using a loop. You can create a loop in Nifi in multiple ways, the easiest I can think of is to use the RetryFlowFile processor. Although its intended more for error handling you can still use it to replicate a loop flow. All you have to do after getting the file is to set the maximum number of retries to how many times ( or steps ) you want to execute against and then redirect the retry relationship to itself and the next step (assign DB & Table based on index). Once the number of retries exceeded you can handle that via the retries_exceeded relationship or you can simply terminate the relationship so its completed.

The RetryFlowFile will set a counter on each retried flowfile which you can use to assign the DB and Table accordingly.

Here is an example of simple flow that loops 5 times against a given flowfile input from GenerateFlowFile:

 

SAMSAL_0-1725695086134.png

This how the  RetryFlowFile is configured:

SAMSAL_1-1725695175006.png

SAMSAL_2-1725695220713.png

The Retry Attribute will store the retries count in the provided attribute name "flowfile.retries".

If you run the GenerateFlowFile once and look at the LogMessage data provenance you will see it was executed 5 times against the same content :

SAMSAL_3-1725695422079.png

If you check the attributes on each event flow file you will see the flowfiles.retries is populated with the nth time the flowfile was retried. Keep in mind the data provenance stores the last event at the top which means the first event flowfile attribute will have the value of 5:

SAMSAL_4-1725695664151.png

hope that helps.  If it does,  please accept the solution.

Thanks

 

 

 

View solution in original post

3 REPLIES 3

avatar
Contributor

Appreciate any help with my above question about DuplicateFlowFile usage. 

@SAMSAL    @MattWho   

avatar
Super Guru

Hi @shiva239 ,

I dont think there is a problem using the DuplicateFlowFile processor in production environment. If its really intended for test environment then it wont be provided as an option to begin with. Actually when I search the community for question around flowfile replication I found this post that mentions this processor as part of the solution but I dont see any comments advising against it siting test vs prod regardless if it helped in that case or not.

However, if you are not comfortable using the Duplicate processor, there are of course different ways of doing the same thing in Nifi. Lets assume you want to write code to address this problem instead of using Nifi, how would you do it without having to repeat the same code for each target DB? I think one of the first things that comes to mind is using a loop. You can create a loop in Nifi in multiple ways, the easiest I can think of is to use the RetryFlowFile processor. Although its intended more for error handling you can still use it to replicate a loop flow. All you have to do after getting the file is to set the maximum number of retries to how many times ( or steps ) you want to execute against and then redirect the retry relationship to itself and the next step (assign DB & Table based on index). Once the number of retries exceeded you can handle that via the retries_exceeded relationship or you can simply terminate the relationship so its completed.

The RetryFlowFile will set a counter on each retried flowfile which you can use to assign the DB and Table accordingly.

Here is an example of simple flow that loops 5 times against a given flowfile input from GenerateFlowFile:

 

SAMSAL_0-1725695086134.png

This how the  RetryFlowFile is configured:

SAMSAL_1-1725695175006.png

SAMSAL_2-1725695220713.png

The Retry Attribute will store the retries count in the provided attribute name "flowfile.retries".

If you run the GenerateFlowFile once and look at the LogMessage data provenance you will see it was executed 5 times against the same content :

SAMSAL_3-1725695422079.png

If you check the attributes on each event flow file you will see the flowfiles.retries is populated with the nth time the flowfile was retried. Keep in mind the data provenance stores the last event at the top which means the first event flowfile attribute will have the value of 5:

SAMSAL_4-1725695664151.png

hope that helps.  If it does,  please accept the solution.

Thanks

 

 

 

avatar
Contributor

Thanks @SAMSAL for the clarification and the alternatives for looping in Nifi.  We will consider using DuplicateFlowFile processor.  Do you think nifi documentation should be updated to explicitly mention that it is not only for load tests but also can be used in production flows when there is need to clone flow files ?