Support Questions

Find answers, ask questions, and share your expertise

NiFi - fetchHDFS without ListHDFS

avatar
Explorer

Hello, i need use fetchHDF processor in the middle of the flow. I use first updateAttribute before fetchHDFS to create variable path and filename, but my problem is that i need to take all the files in a folder and i don't know how do it.

92827-updateattribute.jpg

updateattribute.jpg

7 REPLIES 7

avatar
Master Guru

@Pepelu Rico

Use GetHDFSFileInfo processor and configure the Full path property value as `<directory>` and this processor is stateless so you are going to list out all the files from the directory.

GetHDFSFileInfo Configs:

92839-info.png

we are listing out all files in /tmp directory recursively and configured Destination as Attributes, so the flowfiles will have all the write attributes as flowfile attributes.

92840-wa.png

This processor has been added in NiFi-1.7, if you are using earlier version of NiFi then you need to run a script that can list out all the files in the directory then extract the path and use the extracted attribute in FetchHDFS processor.

avatar
Master Guru

@Pepelu Rico

Please check my `updated answer` and we don't need to run the command in the processor as this processor designed to just configure the directory and all the commands will run by the processor it self.

avatar
Master Guru
@Pepelu Rico

Could you once make sure the scheduling of GetHDFSFileInfo processor by default this processor scheduled to run 0 sec(always running), I think that is causing this 10000 flowfiles.

GetHDFSFileInfo processor doesn't store the state so it will always list out the files in the directory.

Change the Run schedule like (1 hr) then this processor will run once per hour and you will get only the number of files in directory.

avatar
Explorer

I don't know how use this command in GetHDFSFileInfo processor and after the fetchHDFS processor. Sorry

avatar
Explorer

@Shu

The problem now is when i try use this processor, the fileflow incrase up to 10,000 fileflows but in the folder there is only 46 files, i don't know if it is a problem or no.

How should I configure after the FetchHDFS processor?

92880-gethdfsfileinfo.jpg

Thank you

avatar
Contributor

If your problem processor generates the flow file more than your existing file then change the processor configuration - > Group Result from None to ALL 

 

 

avatar
Master Mentor

@jricogar 

Why not use the listHDFS processor?
It retains state so that same HDFS files do not get listed multiple times.

Just trying to understand your use case for using FetchHDFS without ListHDFS processor.

Thanks,
Matt