Member since
05-05-2017
4
Posts
4
Kudos Received
0
Solutions
07-18-2017
06:53 PM
@ismail patel Flow files are queued of course because the processor is only running on the primary node. GetFile in a cluster is not the best way to get data into your flow. It would be better to use a ListFile and then a FetchFile processor. Configure each processor in that flow to run on primary node only or have two flow paths, one on primary node only and then a second flow that runs on all nodes.
... View more
05-08-2017
12:07 PM
@ismail patel Backpressure thresholds are soft limits and some processors do batch processing. The listHDFS processor will produce a list of Files from HDFS and produce a single 0 byte FlowFile for each file in that list. It will then commit all those FlowFiles to the success relationship at once. So if back pressure threshold was set to 5, the ListHDFS processor would still dump all FlowFiles on to it. (even if the listing consisted of 1000s of Files). At that point backpressure would be applied and prevent the listHDFS form running again until the queue dropped back below 5, but this is not the behavior you need here. The RouteOnAttribute processor is one of those processors that works on 1 FlowFile at a time. This allows us to more strictly adhere to the back pressure setting of 5 on its unmatched relationship. The fact that I used a RouteOnAttribute processor is not important, any processor that works on FlowFiles one at a time would work. I picked RouteOnAttribute because it operates off of FlowFile Attributes which live in heap memory which makes processing here very fast. Thanks, Matt
... View more