About MattWho

MattWho · ‎02-09-2017

@mliem Would you mind sharing your MergeContent processor's configuration? How large is the volume of tar files coming in to you flow? How many concurrent task do you have on your unpackContent? The reason I ask these questions is because they may all play a factor in why you are seeing the behavior you reported. My first thought would be that you have too few bins configured in your MergeContent processor. The MergeContent processor will start placing FlowFiles from the incoming queues in to bins based on the "Correlation Attribute Name" configured (Should be in your case "fragment.identifier"). If the MergeContent processor runs out of available bins unique bins, the oldest bin is merged. In you case since that oldest bin is incomplete (does not contain all fragments), it is routed to failure. For example you have Maximum number of bins configured to 10 and your incoming queue contains FlowFiles that we produced from more then 10 original tar files. It is possible that the Merge Content processor may be looking to create that 11th bin before all the FlowFiles that correlate to any of the other bins are processed. There are a few things you could try here (1 being most recommended and then bottom of list being the last thing I would try.): 1. Increase "Maximum number of bins" property in MergeContent. 2. Add the "OldestFlowFileFirstPrioritizer" to "Selected Prioritizers" list in the queue feeding your MergeContent. This will have a small impact on throughput performance. When UnpackContent splits your tar files all split files will have similar FlowFile creation timestamps. By setting the above prioritizer, FlowFiles will be placed in bins in timestamp order. If using this strategy, you would still need to have the number of bins set to the number concurrent tasks assigned to your UnpackContent processor plus one. 3. Decrease the "BackPressure Object Threshold" configuration on the incoming queue to the MergeContent processor. This is a soft limit. So lets say you have it set to 1000 and your Unpack Content untar resulted in 2000 FlowFiles, the queue would jump to 2000. The UnpackContent processor would then stop until that threshold dropped back below 1000. This would set few FlowFiles for your MergeContent processor to bin (meaning fewer needed bins). Thanks, Matt

MattWho · ‎02-08-2017

What is Content Repository Archiving? There are three properties in the nifi.properties file that deal with the archiving on content in the NiFi Content Repository. The default NiFi values for these are shown below: nifi.content.repository.archive.max.retention.period=12 hours nifi.content.repository.archive.max.usage.percentage=50% nifi.content.repository.archive.enabled=true The purpose of content archiving is so that users can view and/ or replay content via the provenance UI that is no longer in their dataflow(s). The configured values do not have any impact on the amount of provenance history that is retained. If content associated to a particular provenance event no longer exists in the content archive, provenance will simply report to the user that the content is not available. The content archive is kept in within the same directory or directories where you have configured your content repository(s) to exist. When a "content claim" is archived, that claim is moved in to an archive subdirectory within the same disk partition where it originally existed. This keeps archiving from affecting NiFi's content repository performance with unnecessary writes that would be associated with moving archived Files to a new disk/partition for example. The configured max retention period tells NiFi how long to keep a archived "content claim" before purging it from the content archive directory. The configured max usage percentage tells NiFi at what point it should start purging archived content claims to keep the overall disk usage at or below the configured percentage. This is a soft limit. Let's say the content repository is at 49% usage. A 4GB content claim then becomes eligible for archiving. Once at time this content claim is archived the usage may exceed the configured 50% threshold. At the next checkpoint, NiFi will remove the oldest archived content claim(s) to bring the overall disk usage back or below 50%. So this value should never be set to 100%. The above two properties are enforced using an or policy. Whichever max occurs first will trigger the purging of archived content claims. Let's look at a couple examples: Example 1: Here you can see that are Content Repository has 35% of its disk consumed by Content Claims that are tied to FlowFiles still active somewhere in one or more dataflows on the NiFi canvas. This leaves 15% of the disk space to be used for archived content claims. Example 2: Here you can see that the amount of Content Claims still active somewhere within your NiFi flow has exceeded 50% disk usage in the content repository. As such you can see there are no archived content claims. The content repository archive setting have no bearing on how much of the content repository disk will be used by active FlowFiles in your dataflow(s). As such, it is possible for your content repository to still fill to 100% disk usage. *** This is the exact reason why as a best practice you should avoid co-locating your content repository with any of the other Nifi repositories. It should be isolated to a disk(s) that will not affect other applications or the OS should it fill to 100%. What is a Content Claim? I have mentioned "Content Claim" throughout this article. Understanding what a content claim will help you understand your disk usage. NiFi stores content in the content repository inside claims. A single claim can contain the content from 1 to many FlowFiles. The property that governs how a content claim is built are is found in the nifi.properties file. The default configuration value is shown below: nifi.content.claim.max.appendable.size=50 KB The purpose of content claims is to make the most efficient use of disk storage. This is especially true when dealing with many very small files. The configured max appendable size tells NiFi at what point should NiFi stop appending additional content to an existing content claim before starting a new claim. It does not mean all content ingested by NiFi must be smaller than 50 KB. It also does not mean that every content claim will be at least 50 KB in size. Example 1: Here you can see we have a single content claim that contains both large and small pieces of content. The overall size has exceeded the 10 MB max appendable size because at the time NiFi started streaming that final piece of content in to this claim the size was still below 10 MB. Example 2: Here we can see we have a content claim that contains only one piece of content. This is because once the content was written to this claim, the claim exceeded the configured max appendable size. If your dataflow(s) deal with nothing but files over 10 MB in size, all your content claims will contain only one piece of content. So when is a "Content Claim" moved to archive? A content claim cannot be moved into the content repository archive until none of the pieces of content in that claim are tied to a FlowFile that is active anywhere within any dataflow on the NiFi canvas. What this means is that the reported cumulative size of all the FlowFiles in your dataflows will likely never match the actual disk usage in your content repository. This cumulative size is not the size of the content claims in which the queued FlowFiles reside, but rather just the reported cumulative size of the individual pieces of content. It is for this reason that it is possible for a NiFi content repository to hit 100% disk usage even if the NiFi UI reports a total cumulative queued data size of less than that. Take Example 1 from above. Assuming the last piece of content written to that claim was 100 GB in size, all it would take is for one of those very small pieces of content in that same claim to still exist queued in a dataflow to prevent this claim from being archived. As long as a FlowFile still points at a content claim, that entire content claim can not be purged. When fine tuning your NiFi default configurations, you must always take into consideration your intended data. if you are working with nothing, but very small OR very large data, leave the default values alone. If you are working with data that ranges greatly from very small to very large, you may want to decrease the max appendable size and/or max flow file settings. By doing so you decrease the number of FlowFiles that make it into a single claim. This in turns reduces the likelihood of a single piece of data keeping large amounts of data still active in your content repository.

MattWho · ‎02-07-2017

@Raj B Unfortunately not, but I think that being able to customize the login screen with some user defined text is a cool idea. I suggest you create an Apache Jira for that enhancement idea. https://issues.apache.org/jira/secure/Dashboard.jspa The only other option is to create a unique label on the canvas of each of your environments. The drawback there is the banner is only visible within the process group it was created and if you template your entire flow, that template would be carried from cluster to cluster and the label would thus need to be updated. Thanks, Matt

MattWho · ‎02-07-2017

@Raj B NiFi has an optional property in the nifi.properties file that allows you to place a banner at the top of your canvas: nifi.ui.banner.text= This banner remains visible no matter which process group the user is in. You could configure a unique banner for each of your environments. Thanks, Matt

MattWho · ‎02-07-2017

@Naresh Kumar Korvi The "Conditions" specified for your rule must result in a boolean "true" before the associated "Actions" will be applied against the incoming FlowFile. Your condition you have in the screenshot will always resolve to true... Looking at your "dirname" attribute, it is not going to return your desired directory path of: period1-year/p1-week1/date and your "filename" attribute will be missing the .json extension you are looking for as well: date.json I believe what you are trying to do is better accomplished using the below "Condition" and "Action" configurations: Condition: ${now():format('MM'):le(2):and(${now():format('dd'):le(25)})} dirname: period1-${now():format('yyyy')}/p1-${now():format('ww')}/${now():format('MM-dd-yyyy')} filename: ${now():format('MM-dd-yyyy')}.json Thanks, Matt

MattWho · ‎02-06-2017

@Naresh Kumar Korvi You will want to stick with the "Bin-Packing Algorithm" merge strategy in your case. The reason you are ending up with single files is because of the way the MergeContent processor is designed to work. There are several factors in play here: As the MergeContent processor will start the content of each new FlowFile on a new line. However, at times the incoming content of each FlowFile may be multiple lines itself. So it may be desirable to put a user defined "Demarcator" between the content of each FlowFile should you need to differentiate the content of each merge at a later time. If that is the case, the MergeContent processor provides a "Demarcator" property to accomplish this. An UpdateAttribute processor can be used following the MergeContent processor to set a new "filename" on the resulting merged FlowFile. I am not sure the exact filename format you want to use, but here is an example config that produce a filename like "2017-02-06": Thanks, Matt

MattWho · ‎02-06-2017

The "defragment" merge strategy can only be used to Merge files that have very specific attributes assigned to them. That strategy is typically used to reassemble a FlowFile that was previously split apart by NiFi.

MattWho · ‎02-02-2017

How to access your secured NiFi instance or cluster: So you have secured your NiFi instance (meaning you have configured it for https access) and now you are trying to access the https web UI. Once NiFi is secured, any entity interacting with the UI will need to successfully authenticate and then be authorized to access the particular NiFi resource(s). As of HDF 2.1.1 or Apache NiFi 1.1.0, NiFi supports authentication via user certificates (default - always enabled), Kerberos/Spnego, or username and password based authentication via LDAP, LDAPS, or kerberos. The intent of this article is not to cover the authentication process, but rather to cover the initial admin authorization process. We assume for this article that authentication is successful. How do you know? A quick look in the nifi-user.log will tell you if your users authentication was successful. Following successful authentication comes NiFi Authorization. NiFi authorization can be handled by NiFi's default built in file based authorizer or handled externally via Ranger. This article will cover the default built in file based authorizer. NiFi's built in file based authorization: There are four files in NiFi that contain properties used by NiFi file based authorizer: nifi.properties authorizers.xml users.xml authorizations.xml We will start by showing what role each of these files plays in NiFi user/server authorization. nifi.properties file (Pattern Mapping): The nifi.properties file a lot of key/value pairs that are used my NiFi's core. This file happens to be where users can define identity mapping patterns. These properties allow normalizing user identities such that identities coming from different identity providers (certificates, LDAP, Kerberos) can be treated the same internally in NiFi. It is the resulting value from a matching pattern that is passed to the configured authorizer (NiFi's file based or Ranger). NiFi includes two examples that are commented out in the nifi.properties file; however, you can add as many unique identity mapping patterns as you need. nifi.security.identity.mapping.pattern.dn=^CN=(.*?),OU=(.*?),O=(.*?),L=(.*?),ST=(.*?),C=(.*?)$ nifi.security.identity.mapping.value.dn=$1@$2 nifi.security.identity.mapping.pattern.kerb=^(.*?)/instance@(.*?)$ nifi.security.identity.mapping.value.kerb=$1@$2nifi.security.identity.mapping.value.kerb=$1@$2 All mapping patterns use java regular expressions. They are case sensitive and white space matters between elements. for example ^CN=(.*?),OU=(.*?),O=(.*?),L=(.*?),ST=(.*?),C=(.*?)$ would match on: CN=John Doe,O=SME,L=Bmore,ST=MD,C=US but would not match on: cn=John Doe, o=SME, l=Bmore, st=MD, c=US (Note the lowercase and white spaces) Assuming a DN of CN=John Doe,O=SME,L=Bmore,ST=MD,C=US the associated mapping value would return John Doe@SME Additional mapping patterns can be added simply by adding additional properties to the nifi.properties file similar to the above examples except each must have a unique value following nifi.security.identity.mapping.pattern. or nifi.security.identity.mapping.value. . For example: nifi.security.identity.mapping.pattern.dn2=^CN=(.*?), OU=(.*?)$ nifi.security.identity.mapping.value.dn2=$1 While you can create as many mapping patterns as you like, it is important to make sure that you do not have more then one pattern that can match your incoming user/server identity. Those user identities are run against every configured pattern and only the last pattern that matches will be applied. authorizers.xml (Default configuration supports file-provider) This file is where you will setup your NiFi file based authorizer. It is this file in which you will find the "Initial Admin Identity" property. It is very important that you correctly define an "Initial Admin Identity" before starting your secured https NiFi for the first time. (no worries if you have not, I will discuss how to fix issues when you did not or had a typo). If you are securing a NiFi cluster, you will also need to configure a "Node Identity x" for each node in your cluster (where "x" is sequential numbers). *** Don't forget to remove the comment lines "" from around these properties. So, what values should I be providing to these properties? That depends on a few factors: Which authentication method did I use? User/server/node certificates (default) - User certificates will have a DN in the certificate for that user. This full DN is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. NiFi nodes can only use server certificates to authenticate. Each server is issued server certificates and the Full DNs form those certificates are evaluated by any configured identity mapping patterns and the result is passed to the authorizer. Kerberos/Spnego - The users principal is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. LDAP/LDAPS - Users are presented with a login screen. NiFi's LDAP configuration can be setup to pass either the DN returned by LDAP for the user (default) or the username (supplied at login screen). This return is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. Kerberos - Users are presented with a login screen. The user's principal is evaluated by any configured identity mapping patterns and the result is passed to the authorizer. Did I setup identity pattern mappings? If no identity mapping patterns were defined, the full return from the configured authentication is passed to the authorizer. If the user/server identity fails to match on any of the defined identity mapping patterns, the full return from the configured authentication is passed to the authorizer. What ever the final resulting value will be is what needs to be entered in the "Initial Admin Identity" and " Node Identity x" properties: Let's assume the following user/server DNs and that multiple identity mappings were setup in the nifi.properties file: Sample entity DN: Configured Identity Mapping Pattern: Configured Identity Mapping Value: Resulting value: cn=JohnDoe,ou=SME,dc=work ^cn=(.*?),ou=(.*?),dc=(.*?),dc=(.*?)$ $1 JohnDoe CN=nifi-server1, OU=NIFI ^CN=(.*?), OU=(.*?)$ $1 nifi-server1 CN=nifi-server2, OU=NIFI ^CN=(.*?), OU=(.*?)$ $1 nifi-server2 Your authorizers.xml file would then look like this: The values configured here will be used to seed the users.xml and authorizations.xml files. users.xml The users.xml file is produced the first time and only the first time NiFi is started securely (https). This file will contain your "Initial Admin Identity" and all your "Node Identity x" configured values: authorizations.xml The Authorizations.xml file is produced the first time and only the first time NiFi is started securely (https). NiFi will assign the access policies needed by your "Initial Admin Identity" and "Node Identity x" users/servers: As you can see, your "Initial Admin Identity" user was granted the following resources/access policies: Resource: NiFi UI Access Policy: Details: /flow (R) view the UI All users including admin must have this access policy in order to access and view the NiFi UI. /restricted-components (W) access restricted components This access policy allows granted users the ability to add/configure NiFi components tagged as restricted on the canvas. /tenants (R and W) access users/user groups (view and modify) This access policy allows granted users the ability to add/remove/modify new users and user groups to NiFi for authorization. /policies (R and W) access all policies (view and modify) This access policy allows granted users the ability to add/remove various access policies for any users and user groups. /controller (R and W) access the controller (view and modify) This access policy allows granted users the ability to view/modify the controller including Reporting Tasks, Controller Services, and Nodes in the Cluster You may notice a few additional access policies were granted to your admin user. This will only happen if the NiFi you have secured already had a an existing flow.xml.gz file. In this case the "Initial Admin Identity" is also granted access to view and modify the dataflow at the NiFi root canvas level. By default all sub NiFi process groups inherit their access policies from the parent process group. This effectively gives the admin user full access to the dataflow. The "Node Identity x" servers are granted the following access policies: Resource: NiFi UI Access Policy: Details: /proxy (R and W) proxy user requests (view and modify) Allows proxy machines to send requests on the behalf of others. All nodes in a NiFi cluster must be granted this access policy so users can make changes to the cluster while logged in to any of the NiFi Cluster's nodes. What do I do if i messed up my "Initial Admin Identity" or "Node Identity x" values when setting up my authorizers.xml file? Its is common for users to incorrectly configure the value for the either the "Initial Admin Identity" or "Node Identity x" values. Common mistakes include bad mapping patterns, case sensitivity issues (LDAP DNs always have the cn, ou, etc values in lowercase), white space issues between DN sections (cn=JohnDoe, ou=sme versus cn=JohnDoe,ou=sme). You can use the nifi-user.log to identify the actual value being passed to the authorizer and then follow these steps: Correct your authorizers.xml configuration Delete or rename the current users.xml and authorizations.xml files on all of your NiFi nodes. restart all your nifi nodes NiFi will generate new users.xml and authorizations.xml files from the corrected authorizers.xml file. You should only follow this procedure to correct issues when first setting up a secured NiFi. If an Admin was able to previously access your NiFi's canvas and add new users and granted access policies to those users, all those users and access policies will be lost if you delete the users.xml and authorizations.xml files. Thanks, Matt

MattWho · ‎02-01-2017

@Narasimma varman Make sure the user the NiFi process is running as on your server has the necessary permissions to access that directory path and remove files from it. Matt

MattWho · ‎01-31-2017

@Raj B Not all NiFi Processors will write attributes to FlowFiles about failures or errors. The documentation for each processor should include what attributes are written by that processor and what information those attributes will contain. There is no global enforcement by the NiFi controller on what attributes a processor must create. This is completely in the control of the developer who wrote each processor. That being said, it is good practice that any processor that has a "Failure" relationship should output an "Error" level log message that dictates the nature of the failure. This Error log message would contain the specific processor that produced the ERROR as well as information on the specific FlowFile that was routed to failure and the nature of the failure. It is possible to build a dataflow that monitors NiFi's nifi-app.log (TailFile processor) for ERROR log messages, parses out the relevant information and pass that along to some monitoring system. Thanks, Matt

Online	Online
Last Visited	‎07-08-2026 02:42 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-08-2026 02:42 PM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: MergeContent defrag errors when handling multi...

Understanding how NiFi's Content Repository Archiv...

Re: Ways to distinguish NiFi UI canvas by environm...

Re: Ways to distinguish NiFi UI canvas by environm...

Re: Consuming Kafka, each Json Messages and write...

Re: Consuming Kafka, each Json Messages and write...

Re: Consuming Kafka, each Json Messages and write...

Understanding the "Initial Admin Identity" and "No...

Re: How to load data from local system file to HDF...

Re: How to identify source processor for failed Ni...