Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DetectDuplicates failing - sends on every file, even if several have the same filename

Highlighted

DetectDuplicates failing - sends on every file, even if several have the same filename

New Contributor

I have a NiFi flow which goes thusly:

ListFile ->
FetchFile ->
HashContent ->
DetectDuplicate ->
UpdateAttributes -> (several of them)
PutS3Object

I'm ingesting some files daily, and I'd like it to only send new files through the pipeline. Hence my duplicate problem. ListFile config:

screen-shot-2019-02-03-at-45106-pm.png

Multiple duplicates are getting pulled in by the file getting processors. For example, there are multiple MBA0001.txt or MBA0023.txt coming through. Those are also the file names of each flow file.

I've set DetectDuplicate to detect duplicates off ${filename}. But, the processor does not filter anything out, and sends the same number of files on to the next stage. DetectDuplicate config:

screen-shot-2019-02-02-at-22259-pm.png

So if 50 files go in to DetectDuplicate, and say 25 are duplicates, 50 still go through. I don't get it. Any idea why? Documentation has not been helpful.

For FetchFile, when the not.found connection is cleared, the processor does not send on duplicates. But, if I don't manually empty that connection, it sends on everything, duplicates and all. And this is scheduled to run once a day. I set it to delete files once pulled.

Here's that `FetchFile` config:

screen-shot-2019-02-03-at-42321-pm.png