About MattWho

MattWho · ‎10-11-2016

@Saikrishna Tarapareddy Almost... NiFi stores FlowFile content in claims. A claim can contain 1 to many FlowFile's content. Claims allow NiFi to use large disk more efficiently when dealing with small content files. These claims will only be moved in to the archive directory once every FlowFile associated to that claim has beed auto-terminated in the dataflow(s). Also keep in mind that you can have multiple FlowFiles pointing at the same content (This happens for example when you connect the same relationship multiple times from a processor). Let say you routed a success relationship twice off of an updateAttribute processor. NiFi does not replicate the content, but rather create another FlowFile that points at that same content. So both those FlowFiles now need to reach an auto-termination point before that content claim would be moved to archive. The content claims are defined in the nifi.properties file: nifi.content.claim.max.appendable.size=10 MB nifi.content.claim.max.flow.files=100 The above are the defaults. If a file comes in at less then 10 MB in size, NIFi will try to append it to the next file(s) unless the combination of those files were to exceed the 10 MB max or the claim has already reach 100 files. If a file comes in that is larger then 10 MB it ends up in a claim all by itself. Thanks, Matt

MattWho · ‎10-11-2016

@Saikrishna Tarapareddy The retention settings in the nifi.properties file are for NiFi data archive only. They do not apply to files that are active (queued or still being processed) in any of your dataflows. NiFi will allow you to continue to queue data in your dataflow all the way up to the point where your content repository disk is 100% utilized. That is why backpressure on dataflow connections throughout your dataflow is important to control the amount of FlowFiles that can be queued. Also important to isolate the content repository from other NiFi repositories so if it fills the disk, it does not cause corruption of those other repositories. If content repository archiving is enabled nifi.content.repository.archive.enabled=true then the retention and usage percentage settings in the nifi.properties file take affect. NiFi will archive FlowFiles once they are auto-terminated at the end of a dataflow. Data active your dataflow will always take priority over archived data. If your dataflow should queue to the point your content repository disk is full, the archive will be empty. The purpose of archiving data is to allow users to replay data from any point in the dataflow or be able to download and examine the content of a FlowFile post processing through a dataflow via the NiFi provenance UI. For many this is a valuable feature and to other not so important. If is not important for your org to archive any data, you can simply set archive enabled to false. FlowFiles that are not processed successfully within your dataflow are routed to failure relationships. As long as you do not auto-terminate any of your failure relationships, the FlowFiles remain active/queued in your dataflow. You can then build some failure handling dataflow if you like to make sure you do not lose that data. Matt

MattWho · ‎10-10-2016

@Saikrishna Tarapareddy Since RAID 1 requires a minimum of 2 disks and RAID 10 requires a minimum of 4 disks. You can build either: a. (2) RAID 10 b. (2) RAID 1 and (1) RAID 10 or c. (4) RAID 1 My recommendation for you would be to provision your (8) 600GB disks as follows: - Provision your 8 disks in to (4) RAID 1 (2 disks: 600 GB + 600 GB mirrored (Total capacity 600 GB)) configurations. -------------- (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: 100 - 150 GB --> /var/log/nifi 100 GB --> /opt/nifi/flowfile_repo 50 GB --> /opt/nifi/database_repo remainder --> / (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: Entire RAID as single logical volume --> /opt/nifi/provenance_repo (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: Entire RAID as single logical volume --> /opt/nifi/content_repo1 (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: Entire RAID as single logical volume --> /opt/nifi/content_repo2 --------------- The above will give you ~1.2TB of content_repo storage and ~600GB of Provenance history storage. If provenance history is not as important to you, you could carve off another logical volume on the first RAID 1 for your provenance_repo and allocate all (3) remaining RAID 1 for content repositories. *** Note: NIFi can be configured to use multiple content repositories in the nifi.properties file: nifi.content.repository.directory.default=/opt/nifi/content_repo1/content_repository <-- This line exists already nifi.content.repository.directory.repo2=/opt/nifi/content_repo2/content_repository <-- This line would be manually added. nifi.content.repository.directory.repo3=/opt/nifi/content_repo3/content_repository <-- This line would be manually added. *** NiFi will do file based striping across all content repos. Thanks, Matt

MattWho · ‎10-07-2016

For other users/readers who do not know, HDF 2.0 includes as part of the release includes the following: GetKafka and PutKafka --> Support Kafka 0.8 ConsumeKafka and PublishKafka --> Supports Kafka 0.9 ConsumeKafka_0_10 and PublishKafka_0_10 --> Supports Kafka 0.10 Thanks, Matt

MattWho · ‎10-07-2016

@Ramil Akhmadeev HDP 2.5 comes with Apache Kafka 0.10.0.1. The NiFi getKafka processor uses the Kafka 0.8 client library. For communicating with Kafka 0.10 you should be using the consumeKafka_0_10 NiFi processor.

MattWho · ‎10-04-2016

@Ankit Jain Let me make sure I understand your flow completely. - You have 4 consumeKafka processors all reading from the same topic? If this is your intent, You should have a single consume Kafka processor with the success relationship drawn off of it 4 times (one to each unique putHDFS processor). This cuts down on disk I/O since the consumed data is only written to the NiFi content repository once. - Then you are trying to write that same data to 4 different HDFS endpoints? With only 3 partitions on your Kafka, you can only have three consumers at a time. With 4 nodes in your cluster, one of the nodes at any given time will not be consuming any data. Optimally the number of partitions would be equal to or multiples of the number of nodes in your NiFi cluster. (For example with 4 partitions, you would have 4 nodes with the consumeKafka processor running with 1 concurrent task. With 8 partitions, you would have 4 nodes with the consumeKafka processor 2 concurrent tasks.) Would be interesting to know more about your custom Kafka processor and how it differs from the "Max Poll Records" property in the existing consumeKafka processor. Redistributing data across your cluster is only necessary when dealing with ingest type processors that are not cluster friendly such as getSFTP, listSFTP, GetFTP, etc.... With ConsumeKafka, the most optimized approach is as I described above. Your question about how do I know if all files were consumed form topic... A Kafka topic is typically a living thing with more and more files written and removed from it. Not sure How NiFi would know when all files are consumed. NiFi will just continue to poll the topic for new files. If there is nothing new, NiFi gets nothing. It is the Kafka server that keeps track of what files were served up to a consumer, NiFi does not keep a listing itself. Data is not passed to the success relationship until it is consumed completely successfully. NiFi provenance could be used to track particular files or list all FlowFiles created by a consumeKafka processor, but you would need to know how many files were on the topic, NiFi will not know that. Matt

MattWho · ‎10-03-2016

@vnandigam There are two parts to successfully accessing the NiFi UI, Authentication and Authorization. Since you are getting the insufficient permissions screen, you have successfully authenticated. First you should confirm the DN pattern of this user that has successfully authenticated. If you tail the nifi-user.log while you access your NiFi's UI, you will see a line similar to the following: 2016-10-03 11:47:15,134 INFO [NiFi Web Server-65795] o.a.n.w.s.NiFiAuthenticationFilter Authentication success for CN=nifiadmin,OU=hortonworks Examine the DN presented. Does it match exactly what you had in your "Initial Admin Identity" property you set? Next you will want to confirm that this user was properly added to the users.xml file: <user identifier="9d7b4fe2-8e8b-30a5-8e2a-f6a6a18addfa" identity="CN=nifiadmin,OU=hortonworks"/> The user if it exists will be assigned a UUID (The above UUID is just an example and yours will be different.) Next, verify this user was given the ability to "view the user interface" by examining the authorizations.xml file. Within this file you would expect to see the user's UUID above assigned to one or more policies. In order to even see the UI, users must have the "R" to the "/flow" policy: <policy identifier="6a57bf03-2a93-39d0-87dd-e3aa30f0cd4d" resource="/flow" action="R"> <user identifier="9d7b4fe2-8e8b-30a5-8e2a-f6a6a18addfa"/> </policy> In order to be able to add users to additional access policies, the user would also need "R" and "W" to the "/policies" policy (You can think of this as the Global Admin policy): <policy identifier="9a3a1c92-fa10-3f9d-b2f7-5cd56cd2ca00" resource="/policies" action="R"> <user identifier="9d7b4fe2-8e8b-30a5-8e2a-f6a6a18addfa"/> </policy> <policy identifier="1ff611dd-1536-31f5-a610-64e192e4c43c" resource="/policies" action="W"> <user identifier="9d7b4fe2-8e8b-30a5-8e2a-f6a6a18addfa"/> </policy> If you user has both of the above, you should be able to access the UI and use the interface to grant additional users access and add additional levels of access for yourself and/or any user you added. The following policies are what gives a user the ability to create, modify, and delete new users and/or groups: <policy identifier="dee16f9e-1f09-37ee-806b-e372f1051816" resource="/tenants" action="R"> <user identifier="9d7b4fe2-8e8b-30a5-8e2a-f6a6a18addfa"/> </policy> <policy identifier="69839728-eaf3-345d-849f-e2790cf236ab" resource="/tenants" action="W"> <user identifier="9d7b4fe2-8e8b-30a5-8e2a-f6a6a18addfa"/> </policy> If you find that your authorizations.xml file was empty (Had no policies set in it), it is likely your NiFi had been started previous to you setting the "Initial Admin Identity" property. This Property ONLY works the first time NiFi is started. If the authorizations.xml file was already generated, it will not be re-generated or updated on later starts of NiFi. To correct this, you can delete the authorizations.xml file and restart your NiFi. Since it does not exist this time, the "Initial Admin Identity" user will be created this time. ***Note, if other users already have granted authorizations in this file, those will be lost and will need to be re-created. Only delete the authorizations.xml file if wishing to start over from scratch. Thanks, Matt

MattWho · ‎09-30-2016

@Timothy Spann Looks like you do not have enough file handles The following command will show your current open file limits: # ulimit -a This should be min 10000, but may need to be even higher depending on the dataflow. Matt

MattWho · ‎09-29-2016

@Breandán Mac Parland If you are looking for a way to generate a file with the name "_SUCCESS", you can use the GenerateFlowFile processor to generate a file with random data as its content. You can the use an UpdateAttribute processor to set the filename to "_SUCCESS" by adding a new property with a property name equal to "filename" and a value of "_SUCCESS". Matt

MattWho · ‎09-27-2016

@Parag Garg NiFi can certainly handle dataflow with excess of 123 processors and well in excess of the number of FlowFiles you have here. Different processors exhibit different resource (CPU, Memory, and disk I/O) strain on your hardware. In addition to processors having an impact on memory, so do FlowFiles themselves. FlowFiles are a combination of the Physical content (stored in the NiFi content Repository) and FlowFile Attributes (Metadata associated to the content stored in heap memory). You can experience heap memory issues if your FlowFiles have very large attributes maps. (for example extracting the large amounts of content into attributes.) The first step is identifying which processor(s) in your flow are memory intensive resulting in your OutofMemoryError. Processors such as SplitText, SplitXML, and MergeContent can use a lot of heap if they are producing a lot of split files from a single file or merging a large number of files in to a single file. Th reason being is the merging and splitting is happening in memory until resulting FlowFile(s) are committed to the output relationship. There are ways of handling this resource exhaustion via dataflow design. (for example, merging a smaller number of files multiple times (using multiple MergeContent processors) to produce that one large file or splitting files multiple times (using multiple Split processors). Also be mindful of the number of concurrent tasks assigned to these memory intensive processors. Running with 4 GB of heap is good, but depending on your dataflow, you may find yourself needing 8 GB or more of heap to satisfy the demand created by your dataflow design. Thanks, Matt

Online	Online
Last Visited	‎02-04-2026 01:38 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎02-04-2026 01:38 AM
Posts	3,434
Kudos received	1628

Cloudera Community

Re: Setting TTL per key when writing to redis

Re: Best Practice for configuring registry flows

Re: Nifi 2.7.2 Start Problem

Re: Error importing NiFi workflow template from ve...

Re: nifi 2.6 registry security scan results

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: getkafka don't working in nifi

Re: getkafka don't working in nifi

Re: Best way to check nifi cluster performance.

Re: Nifi UI forbidden

Re: NIFI 1.0 can't empty queues

Re: Create File Using NiFi

Re: NIFI java.lang.OutOfMemoryError: Java heap spa...