Created 09-13-2020 05:22 AM
I am using MiNiFi(0.5.0) to pick and transfer files from a Linux machine to NiFi(1.9.1). In few cases i can see duplicate files being transferred to NiFi.
Flow is setup as below
GetFile ->LogAttribute -> PutFile(archive) -> RemoteProcessGroup
Log 1:
minifi-app.log:2020-09-13 06:09:00,760 INFO [Timer-Driven Process Thread-4] o.a.nifi.remote.StandardRemoteGroupPort RemoteGroupPort[name=Input_Port_GES,targets=https://nifi1.myorg.com:7071/nifi] Successfully sent [StandardFlowFileRecord[uuid=345d9b6d-e9f7-4dd8-ad9a-a9d66fdfd902,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1073154, length=237],offset=0,name=RandomFile1154.txt,size=237]] (237 bytes) to nifi://nifi1.myorg.com:7074 in 48 milliseconds at a rate of 4.74 KB/sec
Log 2 :
minifi-app.log:2020-09-13 06:09:01,910 INFO [Timer-Driven Process Thread-5] o.a.nifi.remote.StandardRemoteGroupPort RemoteGroupPort[name=Input_Port_GES,targets=https://nifi1.myorg.com:7071/nifi] Successfully sent [StandardFlowFileRecord[uuid=f74eb941-a233-4f9e-86ff-07723940f012,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1109014, length=237],offset=0,name=RandomFile1154.txt,size=237],
StandardFlowFileRecord[uuid=522a4350-4cab-476c-a087-a3793101412e,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1074338, length=235],offset=0,name=RandomFile1346.txt,size=235]] (472 bytes) to nifi://nifi1.myorg.com:7074 in 30 milliseconds at a rate of 14.9 KB/sec
In the above log its seen that RandomFile1154.txt file is transferred once at 2020-09-13 06:09:00,760 and then again at 2020-09-13 06:09:01,910 along with RandomFile1346.txt.
I went through the StandardRemoteGroupPort code and i can see that once the transfer is successful, session is committed and it should not be available for next transfer.
I have added the logs to see if my GetFile picked the file twice, but this is not the case, the log printed only once.
Please share your thoughts on this
Created 09-14-2020 09:20 AM
@Umakanth
From your shared log lines we can see two things:
1. "LOG 1" shows "StandardFlowFileRecord[uuid=345d9b6d-e9f7-4dd8-ad9a-a9d66fdfd902" and "LOG 2" shows "Successfully sent [StandardFlowFileRecord[uuid=f74eb941-a233-4f9e-86ff-07723940f012". This tells us these "RandomFile1154.txt" are two different FlowFiles. So does not look like RPG sent the same FlowFile twice, but rather sent two FlowFiles with each referencing the same content. I am not sure how you have your LogAttribute processor configured, but you should look for the log output produced by these two uuids to learn more about these two FlowFiles. I suspect from your comments you will only find one of these passed through your LogAttribute processor.
2. We can see from both logs that the above two FlowFiles actually point at the exact same content in the content_repository:
"LOG 1" --> claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1073154, length=237],offset=0,name=RandomFile1154.txt,size=237]
"LOG 2" --> claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1109014, length=237],offset=0,name=RandomFile1154.txt,size=237]
This typically happens when a FlowFile becomes cloned somewhere in your dataflow. For example: when a relationship from a processor is defined twice.
Since you saw that GetFile only ingested file once, that rules out GetFile as the source of this duplication. But had it been GetFile, you would have not seen identical claim information. LogAttribute only has a single "success" relationship, so if you had drawn two connections with "Success" relationship defined in both, you would have seen duplicates of every ingested content. So this seems unlikely as well. Next you have your PutFile processor. This processor has both "success" and "failure" relationships. I suspect the "success" relationship is assigned to the connection going to your Remote Process Group" and the "failure" relationship assigned to a connection that loops back on the PutFile itself(?). Now if you had accidentally drawn the "failure" connection twice (one may be stack on top of the other), anytime a FlowFile failed in the putFile it would have been routed to one failure connection and cloned to other failure connection. Then on time they both processed successfully by putFile and you end up with the original and clone sent to your RPG.
Hope this helps,
Matt
Created 09-14-2020 09:20 AM
@Umakanth
From your shared log lines we can see two things:
1. "LOG 1" shows "StandardFlowFileRecord[uuid=345d9b6d-e9f7-4dd8-ad9a-a9d66fdfd902" and "LOG 2" shows "Successfully sent [StandardFlowFileRecord[uuid=f74eb941-a233-4f9e-86ff-07723940f012". This tells us these "RandomFile1154.txt" are two different FlowFiles. So does not look like RPG sent the same FlowFile twice, but rather sent two FlowFiles with each referencing the same content. I am not sure how you have your LogAttribute processor configured, but you should look for the log output produced by these two uuids to learn more about these two FlowFiles. I suspect from your comments you will only find one of these passed through your LogAttribute processor.
2. We can see from both logs that the above two FlowFiles actually point at the exact same content in the content_repository:
"LOG 1" --> claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1073154, length=237],offset=0,name=RandomFile1154.txt,size=237]
"LOG 2" --> claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1599937413340-1, container=default, section=1], offset=1109014, length=237],offset=0,name=RandomFile1154.txt,size=237]
This typically happens when a FlowFile becomes cloned somewhere in your dataflow. For example: when a relationship from a processor is defined twice.
Since you saw that GetFile only ingested file once, that rules out GetFile as the source of this duplication. But had it been GetFile, you would have not seen identical claim information. LogAttribute only has a single "success" relationship, so if you had drawn two connections with "Success" relationship defined in both, you would have seen duplicates of every ingested content. So this seems unlikely as well. Next you have your PutFile processor. This processor has both "success" and "failure" relationships. I suspect the "success" relationship is assigned to the connection going to your Remote Process Group" and the "failure" relationship assigned to a connection that loops back on the PutFile itself(?). Now if you had accidentally drawn the "failure" connection twice (one may be stack on top of the other), anytime a FlowFile failed in the putFile it would have been routed to one failure connection and cloned to other failure connection. Then on time they both processed successfully by putFile and you end up with the original and clone sent to your RPG.
Hope this helps,
Matt