About MattWho

MattWho · ‎02-16-2017

@Timothy Spann NiFi is not coming up or the NiFi UI can't be accessed? Is NiFi actually shutting back down after you start it? I suggest you try setting the following property to false in the nifi.properties file and restart your NiFi to see fi you can then access your UI to remove the problematic components: nifi.flowcontroller.autoResumeState=false If using Ambari to manage your NiFi, uncheck the box: Thanks, Matt

MattWho · ‎02-16-2017

@Sonny Heer The HDP stack does not include NiFi. HDF is an entirely different software stack managed by its own Ambari and is installed on its own hardware. There are some post out there that will walk you through manually adding the NiFi service to an existing HDP stack; however, doing so breaks the upgradeability of your HDP stack. Here is an example of that: https://github.com/abajwa-hw/ambari-nifi-service *** Pay special attention to the "Limitations" noted on the above link. At some point in the future there will be an Ambari platform that can be used to install both HDP and HDF stack services that will still support the upgradeability, but that is not here yet and I don't know hen that will happen. Thank you, Matt

MattWho · ‎02-15-2017

@Varun R The ReplaceText processor has two Evaluation Modes (Line-by-Line and Entire text). Entire Text is the default which reads the entire contents of your FLowFile in to NIFI's JVM heap memory for evaluation. With such a large file this strategy is not ideal and could lead to out of memory conditions for your NiFi. If the content of your FlowFile is multiple lines, you could switch to using the Line-by-Line evaluation mode which will result in less heap memory usage but ultimately the same resulting modified content in the outgoing FlowFile. So you might want to try a ReplaceText processor configuration like the following: Thanks, Matt

MattWho · ‎02-13-2017

@marksf I recommend against hand editing the users.xml and authorizations.xml files. You should be adding additional users and granted appropriate access policies via the NiFi UI instead. The below image covers where to add new users and how to set both global and component level access policies. At a minimum, all users added to nifi must be granted the Global "access policy" for "view the user interface" before they can access the NiFi UI. Other policies will allow user to interface with various features/capabilities within the UI. Matt

MattWho · ‎02-13-2017

@Michal R I think I understand now. You have a FlowFile that contains an attribute "suffix" with some value assigned to it. Let's assume that is "cat" for now. You have a registry file that contains a bunch of key value pairs for things like "nifi.prefix.cat", "nifi.prefix.dog", etc... Now you want to be able to return the value of "nifi.prefix.cat". In that case, you would use the following EL statement in your UpdateAttribute processor: ${${suffix:prepend('nifi.prefix.')}} The first input to an EL statement is the subject. The function is then applied against the value returned for that subject. Thanks, Matt

MattWho · ‎02-13-2017

@Michal R I believe what you are trying to do is take two key/value pairs assigned as FlowFile Attributes to an incoming FlowFile and merge them together to form a new filename. ${nifi.prefix}.${suffix} Assuming your incoming FlowFile has the following attributes: nifi.prefix=myfile suffix=cat The result from the above NiFi Expression Language statement would be: myfile.cat Thank, Matt

MattWho · ‎02-09-2017

@milind pandit The errors you are seeing would be expected during startup since your ZK will not establish quorum until all three nodes have completely started. As a node goes through its startup process it will being begin trying to establish zk quorum between all other zk nodes. Those other nodes may not be running yet if the other nodes are still starting as well, thus producing a lot of ERROR messages. Using the embedded zk is not recommended in a production environment since they are stopped and started along with NiFi. It is best to use dedicated external zookeepers in production. If the errors continue to persist even after all three nodes are fully running, check the below: 1. Verify that you have enabled the embedded zk on all three of your nodes. 2. Verify the zk nodes on each of your servers started and bound to the configured zk ports configured in your zookeepeer.properties file. 3. Make sure you are using resolvable hostnames for each of your zk nodes. 4. Make sure you do not have any firewalls that would prevent your NiFi nodes from being able to communicate between each other over the configured zk hostnames and ports. Thanks, Matt

MattWho · ‎02-09-2017

@mliem Would you mind sharing your MergeContent processor's configuration? How large is the volume of tar files coming in to you flow? How many concurrent task do you have on your unpackContent? The reason I ask these questions is because they may all play a factor in why you are seeing the behavior you reported. My first thought would be that you have too few bins configured in your MergeContent processor. The MergeContent processor will start placing FlowFiles from the incoming queues in to bins based on the "Correlation Attribute Name" configured (Should be in your case "fragment.identifier"). If the MergeContent processor runs out of available bins unique bins, the oldest bin is merged. In you case since that oldest bin is incomplete (does not contain all fragments), it is routed to failure. For example you have Maximum number of bins configured to 10 and your incoming queue contains FlowFiles that we produced from more then 10 original tar files. It is possible that the Merge Content processor may be looking to create that 11th bin before all the FlowFiles that correlate to any of the other bins are processed. There are a few things you could try here (1 being most recommended and then bottom of list being the last thing I would try.): 1. Increase "Maximum number of bins" property in MergeContent. 2. Add the "OldestFlowFileFirstPrioritizer" to "Selected Prioritizers" list in the queue feeding your MergeContent. This will have a small impact on throughput performance. When UnpackContent splits your tar files all split files will have similar FlowFile creation timestamps. By setting the above prioritizer, FlowFiles will be placed in bins in timestamp order. If using this strategy, you would still need to have the number of bins set to the number concurrent tasks assigned to your UnpackContent processor plus one. 3. Decrease the "BackPressure Object Threshold" configuration on the incoming queue to the MergeContent processor. This is a soft limit. So lets say you have it set to 1000 and your Unpack Content untar resulted in 2000 FlowFiles, the queue would jump to 2000. The UnpackContent processor would then stop until that threshold dropped back below 1000. This would set few FlowFiles for your MergeContent processor to bin (meaning fewer needed bins). Thanks, Matt

MattWho · ‎02-08-2017

What is Content Repository Archiving? There are three properties in the nifi.properties file that deal with the archiving on content in the NiFi Content Repository. The default NiFi values for these are shown below: nifi.content.repository.archive.max.retention.period=12 hours nifi.content.repository.archive.max.usage.percentage=50% nifi.content.repository.archive.enabled=true The purpose of content archiving is so that users can view and/ or replay content via the provenance UI that is no longer in their dataflow(s). The configured values do not have any impact on the amount of provenance history that is retained. If content associated to a particular provenance event no longer exists in the content archive, provenance will simply report to the user that the content is not available. The content archive is kept in within the same directory or directories where you have configured your content repository(s) to exist. When a "content claim" is archived, that claim is moved in to an archive subdirectory within the same disk partition where it originally existed. This keeps archiving from affecting NiFi's content repository performance with unnecessary writes that would be associated with moving archived Files to a new disk/partition for example. The configured max retention period tells NiFi how long to keep a archived "content claim" before purging it from the content archive directory. The configured max usage percentage tells NiFi at what point it should start purging archived content claims to keep the overall disk usage at or below the configured percentage. This is a soft limit. Let's say the content repository is at 49% usage. A 4GB content claim then becomes eligible for archiving. Once at time this content claim is archived the usage may exceed the configured 50% threshold. At the next checkpoint, NiFi will remove the oldest archived content claim(s) to bring the overall disk usage back or below 50%. So this value should never be set to 100%. The above two properties are enforced using an or policy. Whichever max occurs first will trigger the purging of archived content claims. Let's look at a couple examples: Example 1: Here you can see that are Content Repository has 35% of its disk consumed by Content Claims that are tied to FlowFiles still active somewhere in one or more dataflows on the NiFi canvas. This leaves 15% of the disk space to be used for archived content claims. Example 2: Here you can see that the amount of Content Claims still active somewhere within your NiFi flow has exceeded 50% disk usage in the content repository. As such you can see there are no archived content claims. The content repository archive setting have no bearing on how much of the content repository disk will be used by active FlowFiles in your dataflow(s). As such, it is possible for your content repository to still fill to 100% disk usage. *** This is the exact reason why as a best practice you should avoid co-locating your content repository with any of the other Nifi repositories. It should be isolated to a disk(s) that will not affect other applications or the OS should it fill to 100%. What is a Content Claim? I have mentioned "Content Claim" throughout this article. Understanding what a content claim will help you understand your disk usage. NiFi stores content in the content repository inside claims. A single claim can contain the content from 1 to many FlowFiles. The property that governs how a content claim is built are is found in the nifi.properties file. The default configuration value is shown below: nifi.content.claim.max.appendable.size=50 KB The purpose of content claims is to make the most efficient use of disk storage. This is especially true when dealing with many very small files. The configured max appendable size tells NiFi at what point should NiFi stop appending additional content to an existing content claim before starting a new claim. It does not mean all content ingested by NiFi must be smaller than 50 KB. It also does not mean that every content claim will be at least 50 KB in size. Example 1: Here you can see we have a single content claim that contains both large and small pieces of content. The overall size has exceeded the 10 MB max appendable size because at the time NiFi started streaming that final piece of content in to this claim the size was still below 10 MB. Example 2: Here we can see we have a content claim that contains only one piece of content. This is because once the content was written to this claim, the claim exceeded the configured max appendable size. If your dataflow(s) deal with nothing but files over 10 MB in size, all your content claims will contain only one piece of content. So when is a "Content Claim" moved to archive? A content claim cannot be moved into the content repository archive until none of the pieces of content in that claim are tied to a FlowFile that is active anywhere within any dataflow on the NiFi canvas. What this means is that the reported cumulative size of all the FlowFiles in your dataflows will likely never match the actual disk usage in your content repository. This cumulative size is not the size of the content claims in which the queued FlowFiles reside, but rather just the reported cumulative size of the individual pieces of content. It is for this reason that it is possible for a NiFi content repository to hit 100% disk usage even if the NiFi UI reports a total cumulative queued data size of less than that. Take Example 1 from above. Assuming the last piece of content written to that claim was 100 GB in size, all it would take is for one of those very small pieces of content in that same claim to still exist queued in a dataflow to prevent this claim from being archived. As long as a FlowFile still points at a content claim, that entire content claim can not be purged. When fine tuning your NiFi default configurations, you must always take into consideration your intended data. if you are working with nothing, but very small OR very large data, leave the default values alone. If you are working with data that ranges greatly from very small to very large, you may want to decrease the max appendable size and/or max flow file settings. By doing so you decrease the number of FlowFiles that make it into a single claim. This in turns reduces the likelihood of a single piece of data keeping large amounts of data still active in your content repository.

MattWho · ‎02-08-2017

@milind pandit Tell us something about your particular NiFi installation method: 1. Was this NiFi cluster installed via Ambari or command line? 2. Are you using NiFi internal zookeepers or external zookeepers? Is this the entire stack trace from the nifi-app log?

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,133
Kudos received	1560

Cloudera Community

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: HDF 2.1.1 Won't restart after Hbase down

Re: best practices - Integrate Nifi with existing ...

Re: Replace Text using Regex with large file in NI...

Re: Error : Unable to perform the desired action d...

Re: NiFi dynamic properties (UpdateAttribute) - is...

Re: NiFi dynamic properties (UpdateAttribute) - is...

Re: NiFi Cluster : Startup exception "Cluster is s...

Re: MergeContent defrag errors when handling multi...

Understanding how NiFi's Content Repository Archiv...

Re: NiFi Cluster : Startup exception "Cluster is s...