Member since
07-30-2019
3421
Posts
1624
Kudos Received
1010
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 65 | 01-13-2026 11:14 AM | |
| 201 | 01-09-2026 06:58 AM | |
| 522 | 12-17-2025 05:55 AM | |
| 583 | 12-15-2025 01:29 PM | |
| 563 | 12-15-2025 06:50 AM |
01-17-2017
02:38 PM
2 Kudos
@Sebastian Carroll Hard to say what any custom code is doing as far as heap usage, but some existing processors can use considerable heap space. I would say that FlowFiles attributes consume the majority of the heap space in most cases. A FlowFile consists of two parts, the FlowFile Content (which lives in the NiFi content repository) and the FlowFile Attributes (This is metadata about the FlowFile and lives in heap [1] ). While generally speaking the amount of heap that FlowFile Attributes consumes is relatively small, users can build flows that have the exact opposite affect. If the user's dataflow uses processors to read large amounts of content and write it to NiFi FlowFile Attributes, heap usage will go up rapidly. If users allow large connection queues to build within the dataflow, heap usage will go up.
- Evaluate available system memory and the configured heap size for your NiF. The heap defaults for NiFi are relatively small. They are set in the bootstrap.conf and have default values of only 512 MB min and max. This is generally to small for any significant dataflow. I recommend setting both min and max values to the same value. Adjust these values according to available free memory on your system with out going to crazy. Try 4096MB and see how that performs first. Adjusting heap setting will require a nifi restart to take affect. - Evaluate your dataflow for areas where high connection queues exist. Setting backpressure through out your dataflow is one way to keep queues from growing to large. - Evaluate your flow for anywhere where you may be extracting content form your FlowFiles in to FlowFile attributes. IS it necessary or can the amount of content extracted be reduced. - Processors like mergeContent, SplitContent, SplitText, etc can use a lot of heap depending on the incoming FlowFile(s) and configuration. For example a mergeContent configured to merge 100,000 FlowFiles is going to use a lot of heap bining that many FlowFiles. A better approach is to use to mergeContent processor in a row with the first merging 10,000 and the second merging bundles of 10 again to create the 100,000 desired end result. Same goes for SplitText. If your source FlowFile results in excess of 10,000 splits, try using two SplitText processors (First splitting by every 10,000 lines and the second splitting those by every line.) With either example above you are reducing he amount of FlowFiles held in heap memory at any given time. Notes: [1] -- NiFi uses FlowFile swapping to help reduce heap usage. FlowFile attributes live in heap memory for faster processing. If a connection exceeds the configured swap threshold (default 10,000 set in nifi.properties), NiFi begins swapping out FlowFile attributes to disk. One must remember that this swapping is per connection. This swapping is not based on any heap usage but rather by object thresholds so values may need to be adjusted based on average FlowFile Attribute size.
... View more
01-12-2017
11:17 PM
@Adda Fuentes no problem
... View more
01-12-2017
11:02 PM
1 Kudo
@Raj B You can think of the "Max Bin Age" as the trump card. Regardless of any other min criteria being met, the bin will be merged once it reaches this max age. So you assumption is completely correct. That aside, you need to take heap usage into consideration with this dataflow design you have here. FlowFile attributes (metadata) lives in heap memory space for performance issues. So as you are bining these FlowFiles throughout the day, your JVM heap usage is going to grow and grow. So how many FlowFiles per day are you talking about here? If you are talking in excess of 10,000 FlowFiles, you may need to adjust your dataflow some. For example use two mergeContent processors back to back. The first merges at lets say a max bin age of 5 minutes. Then the second merges those bundles into a large 24 hour bundle. So 1 new FlowFile is created every 5 minutes and then those 288 merged FlowFiles are merged into a larger FlowFile in the second mergeContent. Doing it this greatly reduces the heap usage. Of course depending on volumes you may need to even merge more often then 5 minutes to achieve optimal heap usage. Just some food for thought.... Matt
... View more
01-12-2017
10:51 PM
1 Kudo
@Adda Fuentes NiFi Authentication always defaults to TLS certificates. If the user does not present a user certificate then NiFI will fall over to the alternate configured login identifier (either LDAP or Kerberos). NiFi does not support specifying more then one of these alternate login identity providers (ldap-provider or kerberos-provider) at a time. Current versions of NiFi have also added Spnego support for user authentication. This authentication when configured in the nifi.properties file falls between user certificates and any login-identity-providers configured in the login-identity-providers.xml file. Setting up Spnego will require configuration changes to your browser to support logging in without needing to use username an password as you would with the kerberos-provider. See below for more details on setting up Spnego for user authentication: http://bryanbende.com/development/2016/08/31/apache-nifi-1.0.0-kerberos-authentication The Identity mapping patterns allow you to take the DN returned by LDAP or the users certificate and map it to a different value. This makes it easier to setup user authorizations since you will only need to provide that mapped value as the user name for the authorization instead of the full DN. The Kerberos pattern mapping has similar intent. So you may use pattern mapping to remove the @domain portion of the principal. Matt
... View more
01-12-2017
10:44 PM
1 Kudo
@Raj B If you have FlowFiles arriving via multiple input ports and then passing through some common set of components downstream from them, there is no way to tell by looking at the FlowFile in a given queue which input port it originated from. Input and output ports within a process group do not create provenance events either since they do not modify the FlowFiles in anyway. The only way an input port or output port would generate a Provenance event is if it was on the root canvas level since inout would generate a "create" event and output ports would create a "drop" event. Provenance will show a lineage for a FlowFile which will show any processor which routed or modified the FlowFile in some way. So by looking at the details of the various events in the provenance lineage graph you can see where the FlowFile traversed through your Flow. However, as I stated not all processors create provenance events. When you query provenance, you can access the lineage for any of the query results by clicking the show lineage icon: A lineage graph for the specific FlowFile will then be created and displayed: The red dot show the event the lineage was calculate from. Every circle is another event in this particular FlowFiles life. You can right click on any of the events to view the details of the event including which specific processor in your flow produced that event. Thanks, Matt
... View more
01-11-2017
02:29 PM
@Raj B A good rule of thumb is to attempt retry anytime you are dealing with an external system where failures resulting from things that are out of NiFi's control can occur (Network outages, destination systems have no disk space, destination has files of same name, etc...) This is not to say that there may not be cases where an internal situation would warrant a retry as well. Using MergeContent as an example. The Processor bins incoming FlowFiles until criteria is met for merge. If at the time of merge there is not enough disk space left in the content repository for the new merged FlowFile it will get routed to failure. Other processing of FlowFiles may result in freed space where a retry might be successful. This is more of a rare case and less likely then the external system examples I first provided. I think you approach is a valid approach and keep in mind as Joe mentioned on this thread, the NIFi community is working toward better error handling in the future.
... View more
01-11-2017
01:29 PM
@Joshua Adeleke Another option might be to use the ReplaceText processor to find the first two lines and replace them with nothing. Glad to hear you got things working for you.
... View more
01-10-2017
09:43 PM
6 Kudos
@Raj B Not all Errors are equal. I would avoid lumping all failure relationships into the same error handling strategy. Some Errors are no surprise, can be expected to occur on occasion, and may be a one time thing that resolves itself. Lets use your example above....
The putHDFS processor is likely to experience some failure over time do to events outside of NiFi's control. For example, let say a file in the middle of transferring to HDFs when the network connection is lost. NIFi would in turn route that FlowFile to failure. If that failure relationship had been routed back on the putHDFS, it would have likely been successful on the subsequent attempt. A better error handling strategy in this case may be to build a simple error handling flow that can be used when the type of failure might lead to self resolution. So here you see Failed FlowFiles enter at "data", they are then checked for a failure counter and if one does not exist it is created and set to 1. If it exists, it is incremented by 1. The check recount count will continue to pass the file to "retry" until the same file has been seen x number of times. "Retry" would be routed back to the source processor of the failure. after x attempts the counter is reset, an email is sent, and the file is place in some local error directory for manual intervention. https://cwiki.apache.org/confluence/download/attachments/57904847/Retry_Count_Loop.xml?version=1&modificationDate=1433271239000&api=v2 The other scenario is where the type of failure is not likely to ever correct itself. Your mergeContent processor is a good example here. If the processor failed to merge some FlowFiles, it is extremely likely to happen again, so there is little benefit in looping this failure relationship back on the processor like we did above. In this case you may want to route this processors failure to a putEmail processor to notify the end user of the failure and where it occurred in the dataflow. The success of the putEmail processor may just feed another processor such as UpdateAttribute which is in a stopped/disabled state. This will hold the data in the dataflow until manually intervention can be taken to identify the issue and either reroute the data back in to the flow once corrected or discard the data. If there is concern over available space in your NiFi Content repository, i would some processor to write it out to a different error file location using putFile, PutHDFS, PutSFTP, etc... Hope this helps, Matt
... View more
01-10-2017
08:00 PM
3 Kudos
@Raj B Process groups can be nested inside process groups and with the granular access controls NiFi provides i may not be desirable for every user who has access to the NiFi Ui to be able to access all processors or the specific data those processors are using. So in addition to your valid example above, you may want to create stove pipe dataflows based off different input ports where only specific users are allowed view and modify to the stove pipe dataflow they are responsible for. While you of course can have flowfiles from multiple upstream sources feed into a single input port and then use a routing type processor to split them back out to different dataflows, it can be easier just to have multiple input ports to achieve the same affect with less configuration. Matt
... View more
01-10-2017
07:03 PM
@Joshua Adeleke If you found this information helpful in guiding you with your dataflow design, please accept the answer.
... View more