Support Questions

Find answers, ask questions, and share your expertise

Filter duplicate files in NifI


I have getSFTP processor which runs on 3 nodes. The getSFTP previously was running on primary node. As my node was not working properly I had to set to schedule to run on "all nodes" as a result I am receiving duplicate files.

Could you please let me know how to filter to take only one file from these 2 files(both are same files) and load into HDFS. Which means I have to put one file out of two duplicates to the data lake

Thank you


@PVVK  @Kezia 





You can use DetectDuplicate processor and only route non-duplicate to HDFS.


But using the GETSFTP processor should work fine as you configured it to run on primary node only. What errors were you facing back then ?


Thank you for the reply @Kezia . I was able to filter the duplicates using detect duplicate processor.

This is the error I'm getting when getsftp processor was scheduled on primary node

GetFTP[id=xxxx] Unable to fetch listing from remote server due to Connection timed out (Connection timed out): Connection timed out (Connection timed out)

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.