- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Filter duplicate files in NifI
- Labels:
-
Apache NiFi
-
HDFS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have getSFTP processor which runs on 3 nodes. The getSFTP previously was running on primary node. As my node was not working properly I had to set to schedule to run on "all nodes" as a result I am receiving duplicate files.
Could you please let me know how to filter to take only one file from these 2 files(both are same files) and load into HDFS. Which means I have to put one file out of two duplicates to the data lake
Thank you
Created ‎11-13-2020 08:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
You can use DetectDuplicate processor and only route non-duplicate to HDFS.
But using the GETSFTP processor should work fine as you configured it to run on primary node only. What errors were you facing back then ?
Created ‎11-18-2020 02:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the reply @Kezia . I was able to filter the duplicates using detect duplicate processor.
This is the error I'm getting when getsftp processor was scheduled on primary node
GetFTP[id=xxxx] Unable to fetch listing from remote server due to java.net.ConnectException: Connection timed out (Connection timed out): Connection timed out (Connection timed out)
