Support Questions

Find answers, ask questions, and share your expertise

Filter duplicate files in NifI

avatar
Contributor

I have getSFTP processor which runs on 3 nodes. The getSFTP previously was running on primary node. As my node was not working properly I had to set to schedule to run on "all nodes" as a result I am receiving duplicate files.

Could you please let me know how to filter to take only one file from these 2 files(both are same files) and load into HDFS. Which means I have to put one file out of two duplicates to the data lake

Thank you

 

@PVVK  @Kezia 

2 REPLIES 2

avatar
Contributor

Hello,

 

You can use DetectDuplicate processor and only route non-duplicate to HDFS.

 

But using the GETSFTP processor should work fine as you configured it to run on primary node only. What errors were you facing back then ?

avatar
Contributor

Thank you for the reply @Kezia . I was able to filter the duplicates using detect duplicate processor.

This is the error I'm getting when getsftp processor was scheduled on primary node

GetFTP[id=xxxx] Unable to fetch listing from remote server due to java.net.ConnectException: Connection timed out (Connection timed out): Connection timed out (Connection timed out)