- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
NiFi - fetchHDFS without ListHDFS
- Labels:
-
Apache NiFi
Created on ‎10-15-2018 08:56 AM - edited ‎08-17-2019 08:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, i need use fetchHDF processor in the middle of the flow. I use first updateAttribute before fetchHDFS to create variable path and filename, but my problem is that i need to take all the files in a folder and i don't know how do it.
Created on ‎10-15-2018 12:16 PM - edited ‎08-17-2019 08:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use GetHDFSFileInfo processor and configure the Full path property value as `<directory>` and this processor is stateless so you are going to list out all the files from the directory.
GetHDFSFileInfo Configs:
we are listing out all files in /tmp directory recursively and configured Destination as Attributes, so the flowfiles will have all the write attributes as flowfile attributes.
This processor has been added in NiFi-1.7, if you are using earlier version of NiFi then you need to run a script that can list out all the files in the directory then extract the path and use the extracted attribute in FetchHDFS processor.
Created ‎10-15-2018 01:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please check my `updated answer` and we don't need to run the command in the processor as this processor designed to just configure the directory and all the commands will run by the processor it self.
Created ‎10-16-2018 12:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you once make sure the scheduling of GetHDFSFileInfo processor by default this processor scheduled to run 0 sec(always running), I think that is causing this 10000 flowfiles.
GetHDFSFileInfo processor doesn't store the state so it will always list out the files in the directory.
Change the Run schedule like (1 hr) then this processor will run once per hour and you will get only the number of files in directory.
Created ‎10-15-2018 12:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know how use this command in GetHDFSFileInfo processor and after the fetchHDFS processor. Sorry
Created on ‎10-16-2018 07:04 AM - edited ‎08-17-2019 08:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem now is when i try use this processor, the fileflow incrase up to 10,000 fileflows but in the folder there is only 46 files, i don't know if it is a problem or no.
How should I configure after the FetchHDFS processor?
Thank you
Created ‎02-09-2023 08:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your problem processor generates the flow file more than your existing file then change the processor configuration - > Group Result from None to ALL
Created ‎02-13-2023 06:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jricogar
Why not use the listHDFS processor?
It retains state so that same HDFS files do not get listed multiple times.
Just trying to understand your use case for using FetchHDFS without ListHDFS processor.
Thanks,
Matt
