Created on 05-29-2018 03:57 PM - edited 08-17-2019 08:46 PM
Hi everyone,
I am facing a problem during the last days with a NiFi flow using HDFS List and Fetch processors.
The queue between them shows more than one million flow files and a total of 0 MB size.
This is very confusing. If I tried to see one of the files I am able to list them and if I click on the info bottom I can confirm the file size, but it seems to be empty. Back pressure is set to 100K, therefore I could not understand the number of files.
I tried restarting NiFi and dropping the files but the problem returns again.
Attached a screenshot of part of the flow.Any idea would be appreciated.
Best regards,
Paul
Created 05-29-2018 04:09 PM
ListHDFS emits empty (0-byte) flow files that have attributes (such as filename and path, see the doc for details) set on them. In this case FetchHDFS is running way more slowly than ListHDFS (it takes longer to retrieve the file than to list that it's there), which is why you get the backup. Also setting Max Size as a backpressure trigger won't work here since they are 0-byte files. Try setting Max number of Objects for backpressure instead.
Created 05-29-2018 04:09 PM
ListHDFS emits empty (0-byte) flow files that have attributes (such as filename and path, see the doc for details) set on them. In this case FetchHDFS is running way more slowly than ListHDFS (it takes longer to retrieve the file than to list that it's there), which is why you get the backup. Also setting Max Size as a backpressure trigger won't work here since they are 0-byte files. Try setting Max number of Objects for backpressure instead.
Created 05-29-2018 04:33 PM
Just to add to the above correct response...
The backpressure threshold settings for both size and number of FlowFiles are soft limits. When a processor is eligible to execute/run, it will run that thread to completion. The ListHDFS processor for example will list all FlowFiles newer then the last execution/run recorded state. Even if "Back Pressure Object Threshold" is set to 10000, it will not stop the listHDFS processor from listing 1,000,000 flowfiles in a single execution. Once those 1,000,000 FlowFiles are placed on connection back pressure starts being applied. The listHDFS processor will not be eligible to execute/run again until that threshold drop back below the threshold setting of 10,000.
-
Back pressure Data Size Threshold" works in a similar manor. Size in NiFi is always a measure of the size of the content associated to a FlowFile and not the actual size of a FlowFile.
-
Thanks,
Matt
Created 05-29-2018 04:18 PM
But ListHDFS will keep the state and only supposed to pull the changed files.right.??
@Paul Hernandez what are the properties of your ListHDFS.?
Created 05-29-2018 09:09 PM
Hi guys,
thanks so much for the fast support and thanks to the Matts Team @Matt Burgess and @Matt Clarke
I finally understood how the processor works. He emits a flow file with no payload and in the meta attributes are the file details like path and filename. Those are used by the HDFSFetch to fetch the correspondent files.
Kind regards,
Paul