About MattWho

MattWho · ‎04-17-2018

@Rahoul A Unfortunately, you can only have one client writing/appending to the same file in HDFS at a time. The nature of this append capability in HDFS does not mesh well with the NIFi architecture of concurrent parallel operations across multiple nodes. NiFi nodes each run their own copy of the dataflows and work on their own unique set of FlowFiles. While NiFi nodes do communicate health and status heartbeats to the elected cluster coordinator, dataflow specific information like which node is currently appending to a very specific filename in the same target HDFS cluster is not shared. And from a performance design aspect, it makes sense not to do this. - So, aside from the above work-around which reduces the likelihood of conflict, you can also: 1. After whatever preprocessing you perform on the data in NiFi before pushing to HDFS, route all data to the a dedicated node (with a failover node, think postHTTP with failure feeding another postHTTP) in your cluster for the final step of appending to your target HDFS. 2. Install an edge standalone instance of NiFi that simply receives the processed data from your NiFi cluster and writes/appends it to HDFS. - Thanks, Matt

MattWho · ‎04-17-2018

@sri chaturvedi While the above doc is intended to set you on the write path in terms of deploying a well implemented NiFi setup. It will not help with a your dataflow design implementation of hardware limitations. You have to monitor your systems while your dataflow is running for things like: 1. CPU utilization (If CPU utilization is always low, consider increasing the "Max Timer Driven Thread" pool allocated for your NiFi dataflow components. Maybe adding an extra Concurrent task here and there in your flow where there are bottlenecks also. Playing with processor run duration.) 2. Disk performance (specifically the crucial NiFi Repo and log disks) 3. Memory performance (Monitor Garbage Collection, are there lots of occurrence resulting considerable stop-the-world impact on your dataflow. If so, you may need to look at your dataflow design and look for mays to reduce heap usage.) 4. Network performance. Thanks, Matt

MattWho · ‎04-12-2018

Short Description: A NiFi connection is where FlowFiles are temporarily held between two connected NiFi Processor components. Each connection that contains queued NiFi FlowFiles will have a footprint in the JVM heap. This article will breakdown a connection to show how NiFi manages the FlowFiles that are queued in that connection and that affects heap and performance. Article: First let me share the 10,000 foot view and then I will discuss each aspect of the following image: *** NiFi FlowFiles consist of FlowFile Content and FlowFile Attributes/metadata. FlowFile content is never held in a connections heap space. Only the FlowFile Attributes/metadata is placed in heap by a connection. The "Connection Queue": The connection queue is where all FlowFiles queued in the connection are held. To understand how these queued FlowFiles affect performance and heap usage, lets start by focusing on the "Connection Queue" dissection at the bottom of the above image. The overall size of a connection is controlled by the configured "back Pressure Object Threshold" and "Back Pressure Data Size threshold" settings the user defines per connection. Back Pressure Object Threshold and Back Pressure Data Size Threshold: The "Back Pressure Object Threshold" default setting here is 10000. The "Back Pressure Data Size Threshold" defaults to 1 GB. Both of these settings are soft limits. This means that they can be exceeded. As an example, lets assume default settings above and a connection that already contains 9,500 FlowFiles. Since the connection has not reached or exceeded object threshold yet, the processor feeding that connection will be allowed to run. If that feeding processor should produce 2,000 FlowFiles when it executes, the connection would grow to 11,500 queued FlowFiles. The preceding processor would then not be allowed to execute until the queue dropped below the configured threshold once again. The same hold true for the Data Size threshold. Data Size is based on the cumulative reported size of the content associated to each queued FlowFile. Now that we know how the overall size of "connection queue" is controlled, lets break it down it to its parts: 1. ACTIVE queue: FlowFiles enter a connection will initially begin to placed in the active queue. FlowFiles will continue to placed in to this queue until that queue has reached the global configured nifi swap threshold. All FlowFiles in the active queue are held in heap memory. The processor consuming Flowfiles from this connection will always pull FlowFiles from the active queue. The size of the active queue per connection is controlled by the following property in the nifi.properties file: nifi.queue.swap.threshold=20000 Increasing the swap threshold increase the potential heap footprint of every single connection in your dataflow(s). 2. SWAP queue: Based on above default setting, once a connection reaches 20,000 FlowFiles, new FlowFiles entering the connection are placed in the swap queue. The swap queue is also held in heap and is hard coded to 10,000 FlowFiles max. If space is freed in the active queue and no swap files exist, FlowFiles in the swap queue will be moved directly to the active queue. 3. SWAP FILES: Each time the swap queue reaches 10,000 FlowFiles, a swap file is written to disk that contains those FlowFiles. At that point new FlowFiles are again written to the swap queue. Many Swap files can be created. Using image above where connection contains 80,000 FlowFiles, there would be 30,000 FlowFiles in heap and 5 swap files. As the active queue has freed 10,000 FlowFiles, the oldest swap file are moved to the active queue until all swap files are gone. The fact that swap files must be written to and read from disk, having a lot of swap files being produced across your dataflow will affect throughput performance of your dataflow(s). 4. IN-FLIGHT queue: Unlike the above 3, the in-flight queue only exists when the processor consuming from this connection is running. The consuming processor will only pull FlowFiles from the active queue and place them in the in-flight queue until processing has successfully completed and those FlowFiles have been committed to an outbound connection from the consuming processor. This in-flight queue is also held in heap. Some processors work on 1 FlowFile at a time, others work on batches of FlowFile, and some have the potential of working on every single FlowFile on an incoming connection queue. In the last case, this could mean high heap usage while those FlowFiles are being processed. The example above is one of those potential case using the MergeContent processor. The MergeContent processor places FlowFiles from the active queue in virtual bins. How many bins an what makes a bin eligible for merge is governed buy the processor configuration. What is important to understand that is is possible that every FlowFile in the connection could make its way into the "in-flight queue". In image example, if the MergeContent were running, all 80,000 queued Flowfiles would likely be pulled in to heap via the in-flight queue. ---- Take away from this article: 1. Control heap usage by limiting size of connection queue when possible. (Of course if your intention is to merge 40,000 FlowFiles, there must be 40,000 Flowfiles in the incoming connection. However, you could have two mergeContent processors in series each merging smaller bundles with same end result with less overall heap usage.) 2. With default back pressure object threshold settings, there will be no swap files produced on most connections (remember soft limits) which will result in better throughput performance. 3. The default configured swap threshold of 20,000 is a good balance in most cases of active queue size and performance. For smaller flows you may be able to push this higher and for extremely large flows you may want to set this lower. Just understand it is a trade-off of heap usage for performance. But if your run out of heap, there will be zero performance. Thank you, Matt

MattWho · ‎04-10-2018

@sri chaturvedi Thank you for your feedback. Unfortunately, the "Disable" and "Enable" buttons are not available when multiple components are selected. I filled a JIra for such an improvement (https://issues.apache.org/jira/browse/NIFI-5066 ) For now, when dealing with a flow with such a large number of stopped components, it may be easier to simply manually edit the NiFi flow.xml.gz file. What you want to look for are all entries containing the following string: <scheduledState>STOPPED</scheduledState> and replace that with: <scheduledState>DISABLED</scheduledState> My suggestion would be to make a copy of the flow.xml.gz file. Edit the copy as described above. Stop your NiFi instance/cluster. Then switch out the original flow.xml.gz with the new modified copy of the flow.xml.gz on all NiFi instances. Make sure file ownership is correct and restart NiFi. Thank you, Matt

MattWho · ‎04-09-2018

Short Description: This article covers how to improve the performance of the NiFi UI. Article: Over time it has been seen that the users of NiFi have been building very large dataflows consisting of many thousands of components (processor, reporting tasks, controller services, etc). While NiFi in no way limits to any degree the number of components that can be added to the NiFi canvas, the more components a user adds, the less responsive the UI becomes. This processor explosion not only affects the responsiveness of the UI, but can also lead to unexpected node disconnections. --- What are the various states a component can have? NiFi components have multiple states that consist of stopped, started, enabled, and disabled. Beyond these states exists one of two statuses: Valid: Component configuration was successfully validated. This means that all required properties have been configured and in the case of processors all required connections have been accounted for (connected to another component or terminated) and any referenced controller services have been enabled. Invalid: Component configuration is not valid. This means that one or more required properties have not been configured and/or in the case of processors one or more connections have not been accounted for (connected to another component or terminated) and/or a referenced controller services have not been enabled. --- Why does a processors state affect UI performance? All processor components when added to the canvas are added in the "stopped" state. A user can then either start or disable that component manually. All Controller Services and Reporting tasks added by a user are by default disabled. The user can then enable these components as needed. NiFi regularly must validate these components to see if they are valid or invalid. While the validation of a few hundred to a thousand components adds up to very little time, the same does not hold true for NiFi instances consisting of thousands upon thousands of components. User may have noticed a swirling on the right hand side of the NiFi status bar that seems to never go away. In a NiFi cluster, NiFi must retrieve the flow status from every node. It is possible for a component to be valid on one node but not another (for example, processor depends on local file that does not exist on all nodes). If this validation takes too long a node may be disconnected because the request took to long. Not to mention the UI does not update until these validations have completed. --- What has NiFi done to make improvements here? The bad news: Prior to NiFi 1.1.0 there is nothing that can be done to improve performance here other then reducing the number of components you are using. This is because in versions of NiFi prior to components were validated in all four states. The good news: In NiFi 1.1.0 a change was made so that this validation only occurs on components that are in the "stopped" state and controller services or reporting tasks that are disabled. It is safe to assume that if a processor is running, it must be valid. It is also safe to assume that a Controller Service or a Reporting Task must be valid if it is enabled. Now that these "started" processors and "enabled" controller services or reporting tasks are no longer being validated, the UI performance will be much better. https://issues.apache.org/jira/browse/NIFI-2996 --- What is the important to understand here? It has also been observed that users add lots of components to the UI that are never started or are only started for short periods of time. If the number of "Stopped" processors is very high, validation is still going to take a considerable amount of time even in NiFi 1.10 or newer versions. A quick look at the NiFi status bar above your canvas will show how many stopped components you have on your canvas: To make sure the UI performance remains solid, it is important that users disable processors that are not in use on the canvas. You can use the "NiFi Summary" UI to to find stopped/invalid processors and disable them. Select the "PROCESSORS" tab and sort on the "Run Status" column. Clicking on the on the right hand side of row will take you directly to that processor. Once a component is selected on the canvas it can be disabled or enabled via the "Operate" panel or by right clicking on processor and selecting Disable or Enable for displayed context menu.

MattWho · ‎03-14-2018

@Eric Lloyd At 0 secs the processor is trying to run as fast as possible, so basically no break in processing. Just setting it to 2 or 3 seconds may help.

MattWho · ‎03-07-2018

@Eric Lloyd Not sure that would make a difference. You are well beyond that "initial start position" already. Each execution is working from the recorded state location. Duplication may still occur. I suggest perhaps adjusting your run schedule to something other then 0 secs. This not only helps to reduce resource consumption, it introduces a small delay between each consumption of lines from the log files. This may help when primary node changes occur.

MattWho · ‎03-07-2018

@Eric Lloyd Avoiding duplication during restart as described in documentation is a different scenario. During NiFi shutdown, processors are give a graceful shutdown timer to complete their running tasks (20 seconds default). If a thread still has not completed by then it is killed. In the case where a thread is killed, no FlowFiles have been committed to the tailFile success connection and no update has been made to state. So in restart, no matter which node becomes Primary node, the tailFile start correctly from last successfully recorded state position. Primary node changes do to result in killing of any actively running tasks. It simply puts the processor in a stopping state so it will not execute another task once the current task completes. Matt

MattWho · ‎03-07-2018

@Eric Lloyd I am assuming the file being tailed is mounted across all your NiFi nodes in the cluster? This would need to be the case so that no matter which node becomes the primary node, it could tail the exact same file. Assuming the above is true, I am also assuming processor has been configured with "State Location" configured for "Remote" When listFile executes it begins tailing the target File, at completion of each thread state is recorded as to where that tail let off so next thread can pickup where the previous ended. If you are only storing state "Local" when primary node switches, the new primary node will start tailing from beginning of file again. That being said, there is still a chance for some duplication even when state is set to Cluster. When primary node changes, original primary node is informed it is no longer the primary node and a new node is elected as the primary node. The original node will complete it currently executing task but will not schedule any new tasks to run. The New primary node will start the "primary node" only processors. If the new primary node executes before same processor on old primary node updates cluster state, it is possible new primary node will start tailing from last known recorded cluster state for that processor resulting in some duplication. NiFi favors duplication over data loss. We cannot assume that the original primary node just did not die. So we have to accept the risk that the original primary node processors may never update state. Hope this confirms how your processor is setup and why NiFi works the way it does in this scenario. Thanks, Matt

MattWho · ‎02-20-2018

@Richard Corfield The queue stats on a connection will reflect the number of FlowFiles and cumulative size of those queued FlowFile's content. It does not include any archived content. I am not aware of your throughput rates or the variations in FlowFile Content sizes in your dataflow that may explain what you are seeing. A NiFi Content claim cannot be archived until there are no active FlowFiles anywhere in your dataflow pointing to that claim. Perhaps some screenshots will help make sure we are talking about the same thing when you say "queue". Thank you, Matt

Online	Offline
Last Visited	‎07-09-2026 03:17 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-09-2026 03:17 AM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: How to append HDFS file using putHDFS where Ni...

Re: HDF/NIFI Best practices for setting up a high ...

Dissecting the NiFi "connection"... Heap usage and...

Re: HDF/NiFi Improving the performance of your UI

HDF/NiFi Improving the performance of your UI

Re: Avoiding Duplicate data with Nifi TileFile pro...

Re: Avoiding Duplicate data with Nifi TileFile pro...

Re: Avoiding Duplicate data with Nifi TileFile pro...

Re: Avoiding Duplicate data with Nifi TileFile pro...

Re: Why is the total size of a NiFi queue increasi...