About MattWho

MattWho · ‎01-12-2017

@Raj B If you have FlowFiles arriving via multiple input ports and then passing through some common set of components downstream from them, there is no way to tell by looking at the FlowFile in a given queue which input port it originated from. Input and output ports within a process group do not create provenance events either since they do not modify the FlowFiles in anyway. The only way an input port or output port would generate a Provenance event is if it was on the root canvas level since inout would generate a "create" event and output ports would create a "drop" event. Provenance will show a lineage for a FlowFile which will show any processor which routed or modified the FlowFile in some way. So by looking at the details of the various events in the provenance lineage graph you can see where the FlowFile traversed through your Flow. However, as I stated not all processors create provenance events. When you query provenance, you can access the lineage for any of the query results by clicking the show lineage icon: A lineage graph for the specific FlowFile will then be created and displayed: The red dot show the event the lineage was calculate from. Every circle is another event in this particular FlowFiles life. You can right click on any of the events to view the details of the event including which specific processor in your flow produced that event. Thanks, Matt

MattWho · ‎01-11-2017

@Raj B A good rule of thumb is to attempt retry anytime you are dealing with an external system where failures resulting from things that are out of NiFi's control can occur (Network outages, destination systems have no disk space, destination has files of same name, etc...) This is not to say that there may not be cases where an internal situation would warrant a retry as well. Using MergeContent as an example. The Processor bins incoming FlowFiles until criteria is met for merge. If at the time of merge there is not enough disk space left in the content repository for the new merged FlowFile it will get routed to failure. Other processing of FlowFiles may result in freed space where a retry might be successful. This is more of a rare case and less likely then the external system examples I first provided. I think you approach is a valid approach and keep in mind as Joe mentioned on this thread, the NIFi community is working toward better error handling in the future.

MattWho · ‎01-11-2017

@Joshua Adeleke Another option might be to use the ReplaceText processor to find the first two lines and replace them with nothing. Glad to hear you got things working for you.

MattWho · ‎01-10-2017

@Raj B Not all Errors are equal. I would avoid lumping all failure relationships into the same error handling strategy. Some Errors are no surprise, can be expected to occur on occasion, and may be a one time thing that resolves itself. Lets use your example above.... The putHDFS processor is likely to experience some failure over time do to events outside of NiFi's control. For example, let say a file in the middle of transferring to HDFs when the network connection is lost. NIFi would in turn route that FlowFile to failure. If that failure relationship had been routed back on the putHDFS, it would have likely been successful on the subsequent attempt. A better error handling strategy in this case may be to build a simple error handling flow that can be used when the type of failure might lead to self resolution. So here you see Failed FlowFiles enter at "data", they are then checked for a failure counter and if one does not exist it is created and set to 1. If it exists, it is incremented by 1. The check recount count will continue to pass the file to "retry" until the same file has been seen x number of times. "Retry" would be routed back to the source processor of the failure. after x attempts the counter is reset, an email is sent, and the file is place in some local error directory for manual intervention. https://cwiki.apache.org/confluence/download/attachments/57904847/Retry_Count_Loop.xml?version=1&modificationDate=1433271239000&api=v2 The other scenario is where the type of failure is not likely to ever correct itself. Your mergeContent processor is a good example here. If the processor failed to merge some FlowFiles, it is extremely likely to happen again, so there is little benefit in looping this failure relationship back on the processor like we did above. In this case you may want to route this processors failure to a putEmail processor to notify the end user of the failure and where it occurred in the dataflow. The success of the putEmail processor may just feed another processor such as UpdateAttribute which is in a stopped/disabled state. This will hold the data in the dataflow until manually intervention can be taken to identify the issue and either reroute the data back in to the flow once corrected or discard the data. If there is concern over available space in your NiFi Content repository, i would some processor to write it out to a different error file location using putFile, PutHDFS, PutSFTP, etc... Hope this helps, Matt

MattWho · ‎01-10-2017

@Raj B Process groups can be nested inside process groups and with the granular access controls NiFi provides i may not be desirable for every user who has access to the NiFi Ui to be able to access all processors or the specific data those processors are using. So in addition to your valid example above, you may want to create stove pipe dataflows based off different input ports where only specific users are allowed view and modify to the stove pipe dataflow they are responsible for. While you of course can have flowfiles from multiple upstream sources feed into a single input port and then use a routing type processor to split them back out to different dataflows, it can be easier just to have multiple input ports to achieve the same affect with less configuration. Matt

MattWho · ‎01-10-2017

@Joshua Adeleke If you found this information helpful in guiding you with your dataflow design, please accept the answer.

MattWho · ‎01-10-2017

@Avish Saha The more bins the more of your NiFi JVM heap space that could be used. You just need to keep in mind that if all your bins have lets say 990 KB of data in them and the next file would put any of those queues over 1024 KB, then the oldest bin will still be merged at only 990 KB to make room for a new bin to hold the file that would not fit in any of the existing bins. More bins equals more opportunities for a flowfile to find a bin where it fits... Also keep in mind that as you have it configured, it is also possible for a bin to hang around for an indefinite amount of time. A bin at 999 KB which never gets another qualifying FlowFile that puts its size between 1000 and 1024 will sit forever unless you set the max bin age. This property tells the MergeContent processor to merge a bin no matter what its current state is if it reaches this max age. I recommend you always set this value to the max amount of data latency you are willing accept on this dataflow. If you found all this information helpful, please accept this answer. Matt

MattWho · ‎01-10-2017

@Joshua Adeleke The SplitText processor simply splits the content of an incoming FlowFile into multiple FlowFiles. It gives you the ability to designate how many lines would be considered the header and ignored, but it does no extraction of content in to FlowFile attributes. The ExtractText processor can be used to read parts of the content and assign those parts to different NiFi FlowFile attributes. It will not remove the header form the content, that would still be done during the splitText processor operation. However, every FlowFile created by SplitText will inherit the unique FlowFile attributes from the parent FlowFile. Matt

MattWho · ‎01-10-2017

@Avish Saha Unless you know that your incoming FlowFiles content can be combined to exactly 1 MB with out going over by even a byte, there is little chance you will see files of exactly 1 MB in size. The mergerContent processor will not truncate the content of a FlowFile to make a 1 MB output FlowFile. The more common use case to is to set an acceptable merged size range (min 800 KB - max 1 MB) for example. FlowFiles larger then 1 MB will still pass through unchanged.

MattWho · ‎01-10-2017

@Avish Saha In the case where you are seeing Merged FlowFile larger then 1 MB, i suspect the merge is a single FlowFile that was larger then the 1 MB max. When a FlowFile arrives that exceeds to configured max it is passed to both the original and merged relationships unchanged. decreasing bin number only impacts heap usage but does not change behavior.

Online	Offline
Last Visited	‎12-03-2025 06:17 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎12-03-2025 06:17 PM
Posts	3,397
Kudos received	1615

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: Cannot access the NiFi Registry from NiFi and ...

Re: Error connecting to NiFi Registry from NiFi UI...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: Couple of questions on processor group with mu...

Re: NiFi best practices for error handling

Re: Loading csv files into Oracle DB with NiFi

Re: NiFi best practices for error handling

Re: What's the purpose of multiple input and outpu...

Re: Loading csv files into Oracle DB with NiFi

Re: MergeContent inconsistent in aggregating outpu...

Re: Loading csv files into Oracle DB with NiFi

Re: MergeContent inconsistent in aggregating outpu...

Re: MergeContent inconsistent in aggregating outpu...