About MattWho

MattWho · ‎10-17-2016

@Josh Elser @srinivas padala The "Read/Write" stats on the processor have nothing to do with writing to your SQL end-point. This particular stat is all about reads from and writes to the NIFi content Repository. This helps identify where in your flow you may have disk high disk I/O in the form of either reads or more expensive writes. From the screenshot above, I see that this processor brought in off inbound connections 35,655 FlowFiles in the past 5 minutes. It read 20.87 MB of content from the content repository in that same timeframe. The processor then output 0 FlowFiles to any outbound connection (This indicates all files where either routed to a an auto-terminated relationship). Assuming only the "success" relationship was auto-terminated, all data was sent successfully. If the "failure" relationship (which should not be auto-terminated here) is routed to another processor, the 0 "out" indicates that in the past 5 minutes 0 files failed. The Tasks shows a cumulative total CPU usage reported over the past 5 minutes. A high "Time" value indicates a cpu intensive processor. Thanks, Matt

MattWho · ‎10-14-2016

@Harry You should be able to simply use two ReplaceText processors in series to create the XML structure you are looking for: First replace text configured to prepend text to the binary content as follows: and the second to append text to the resulting content from the above: *** Note: if you hold shift key while hitting enter, you will create a new line in the text editor as shown in the above examples. An example of the output from the above would look like this: There you have the binary content between <flowfile-content> and </flowfile-content> Now all you need to do is adjust the specific prepend and append values you need. Thanks, Matt

MattWho · ‎10-12-2016

The mergeContent Processor simply bins and merges the FlowFiles it sees on an incoming connection at run time. In you case you want each bin to have a min 100 FlowFiles before merging. So you will need to specify that in the "Minimum number of entries" property. I never recommend setting any minimum value without also setting the "Max Bin Age" property as well. Let say you only ever get 99 FlowFiles or the amount of time it takes to get to 100 exceeds the useful age of the data being held. Those Files will sit in a bin indefinitely or for excessive amount of time unless that exit age has been set. Also keep in mind that if you have more then one connection feeding your mergeContent processor, on each run it looks at the FlowFiles on only one connection. It moves in round robin fashion from connection to connection. NiFi provides a "funnel" which allows you to merge FlowFiles from many connections to a single connection. Matt

MattWho · ‎10-12-2016

@boyer It may be helpful to see your dataflow to completely understand what you are doing. When you say you "call 2 other apis at the same time", does that mean you are forking the success relationship from the HandleHttpRequest to two downstream processors? Then you are taking the successes from those processors and merging them back together into a single FlowFile before sending to the HandleHttpResponse processor? Assuming the above is true, how do you have your mergeContent processor configured? 1. I would suggest you use the "http.context.identifier" as the "correlation Attribute Identifier" so that only FlowFiles originating form the same handleHttpRequest are merged together. 2. I also suggest setting "attribute strategy" to "Keep all unique attributes". (if 'Keep All Unique Attributes' is selected, any attribute on any FlowFile that gets bundled will be kept unless its value conflicts with the value from another FlowFile. ) This will be useful if your two intermediate processors set any unique attributes you want to keep on the resulting merged FlowFile. You also want to make sure that your FlowFile makes it from the Request to Response before the configured expiration in your "StandardHttpContextMap" controller service. Thanks, Matt

MattWho · ‎10-12-2016

@Saikrishna Tarapareddy The FlowFile repo will never get close to 1.2 TB in size. That is a lot of wasted money on hardware. You should inquire with your vendor about having them split that Raid in to multiple logical volumes, so you can allocate a large portion of it to other things. Logical Volumes is also a safe way to protect your RAID1 where you OS lives. If some error condition should occur that results in a lot of logging, the application logs may eat up all your disk space affecting you OS. With logical volumes you can protect your root disk. If not possible, I would recommend changing you setup to a a bunch of RAID1 setups. With 16 x 600 GB hard drives you have allocated above, you could create 8 RAID1 disk arrays. - 1 for root + software install + database repo + logs (need to make sure you have some monitioring setup to monitor disk usage on this RAID if logical volumes can not be supported) - 1 for flowfile repo - 3 for content repo - 3 for provenance repo Thanks, Matt

MattWho · ‎10-12-2016

@Ankit Jain When a secure NiFi is started for the first time, a users.xml and authorizations.xml file is generated. The users.xml that is created will have your users added to it using the provided DN form your authorizers.xml file: Initial Admin Identity Node Identity 1 Node Identity 2 Node Identity 3 Node Identity 4 etc... Each of those "users" are assigned a UUID which is then used to set some required policies in the authorizations.xml file in order to be able to access the NiFi UI. At a minimum, all "Node Identity" DN's UUIDs need to be assigned to the /proxy resource (Policy) and /flow (read/R )resource inside that file. You "Initial Admin" DN should have /flow (READ/R and Write/W) and /policies (R and W). If NiFi was secured and started prior to some or all of the above DNs being set in the authorizers.xml, the users.xml and authorizations.xml files will be created without any entries. Updates to these DN properties in the authorizers.xml file later will not cause updated to occur two these files. If you find this is what occurred in your case, you can stop your NiFi nodes, deleted both the users.xml and authorizations.xml files from all nodes and restart. On restart NiFi will again generate these files since they do not exist using the DNs in your authorizers.xml file on each node. Thanks, Matt

MattWho · ‎10-12-2016

@Jobin George When you add new components (Process groups or processors), they inherit the policies from the parent component by default. This means the your process group (Group1) has inherited some policies maybe from its parent process group and your processor (getSFTP) has inherited policies from the process group it is inside. My guess is that those inherited policies are allowing user "john" view and "modify" to process group 'Group1'. When you select a component (process group or processor) and click on the key icon to modify/set its policies, you may notice the following in the "Access Policies" UI that is displayed: This line is telling you that the policies you are currently looking at are coming from a parent process group. If you modify any of these policies, what you are really doing is modifying the policies on that parent process group rather then on the actual selected component. In order to set specific policies for the select component, you must fist click on "Override". You will then see the above effective policy line go away and the specific policy you are currently looking at will be cleared of all entries. Now you can add specific users for this policy that are applied to only tis component. If the component is a process group, any processor or additional process group within will inherit this new policy. Keep in mind that every policy inherits from its parent by default, so clicking on "Override" only create a new policy accesses for that one policy. You will need to select each available policy for a component and click "Override" for each one where you want to set component specific policy accesses. Thanks, Matt

MattWho · ‎10-11-2016

@Saikrishna Tarapareddy Almost... NiFi stores FlowFile content in claims. A claim can contain 1 to many FlowFile's content. Claims allow NiFi to use large disk more efficiently when dealing with small content files. These claims will only be moved in to the archive directory once every FlowFile associated to that claim has beed auto-terminated in the dataflow(s). Also keep in mind that you can have multiple FlowFiles pointing at the same content (This happens for example when you connect the same relationship multiple times from a processor). Let say you routed a success relationship twice off of an updateAttribute processor. NiFi does not replicate the content, but rather create another FlowFile that points at that same content. So both those FlowFiles now need to reach an auto-termination point before that content claim would be moved to archive. The content claims are defined in the nifi.properties file: nifi.content.claim.max.appendable.size=10 MB nifi.content.claim.max.flow.files=100 The above are the defaults. If a file comes in at less then 10 MB in size, NIFi will try to append it to the next file(s) unless the combination of those files were to exceed the 10 MB max or the claim has already reach 100 files. If a file comes in that is larger then 10 MB it ends up in a claim all by itself. Thanks, Matt

MattWho · ‎10-11-2016

@Saikrishna Tarapareddy The retention settings in the nifi.properties file are for NiFi data archive only. They do not apply to files that are active (queued or still being processed) in any of your dataflows. NiFi will allow you to continue to queue data in your dataflow all the way up to the point where your content repository disk is 100% utilized. That is why backpressure on dataflow connections throughout your dataflow is important to control the amount of FlowFiles that can be queued. Also important to isolate the content repository from other NiFi repositories so if it fills the disk, it does not cause corruption of those other repositories. If content repository archiving is enabled nifi.content.repository.archive.enabled=true then the retention and usage percentage settings in the nifi.properties file take affect. NiFi will archive FlowFiles once they are auto-terminated at the end of a dataflow. Data active your dataflow will always take priority over archived data. If your dataflow should queue to the point your content repository disk is full, the archive will be empty. The purpose of archiving data is to allow users to replay data from any point in the dataflow or be able to download and examine the content of a FlowFile post processing through a dataflow via the NiFi provenance UI. For many this is a valuable feature and to other not so important. If is not important for your org to archive any data, you can simply set archive enabled to false. FlowFiles that are not processed successfully within your dataflow are routed to failure relationships. As long as you do not auto-terminate any of your failure relationships, the FlowFiles remain active/queued in your dataflow. You can then build some failure handling dataflow if you like to make sure you do not lose that data. Matt

MattWho · ‎10-10-2016

@Saikrishna Tarapareddy Since RAID 1 requires a minimum of 2 disks and RAID 10 requires a minimum of 4 disks. You can build either: a. (2) RAID 10 b. (2) RAID 1 and (1) RAID 10 or c. (4) RAID 1 My recommendation for you would be to provision your (8) 600GB disks as follows: - Provision your 8 disks in to (4) RAID 1 (2 disks: 600 GB + 600 GB mirrored (Total capacity 600 GB)) configurations. -------------- (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: 100 - 150 GB --> /var/log/nifi 100 GB --> /opt/nifi/flowfile_repo 50 GB --> /opt/nifi/database_repo remainder --> / (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: Entire RAID as single logical volume --> /opt/nifi/provenance_repo (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: Entire RAID as single logical volume --> /opt/nifi/content_repo1 (1) RAID 1 (~600 GB capacity) with the following mounted logical volumes: Entire RAID as single logical volume --> /opt/nifi/content_repo2 --------------- The above will give you ~1.2TB of content_repo storage and ~600GB of Provenance history storage. If provenance history is not as important to you, you could carve off another logical volume on the first RAID 1 for your provenance_repo and allocate all (3) remaining RAID 1 for content repositories. *** Note: NIFi can be configured to use multiple content repositories in the nifi.properties file: nifi.content.repository.directory.default=/opt/nifi/content_repo1/content_repository <-- This line exists already nifi.content.repository.directory.repo2=/opt/nifi/content_repo2/content_repository <-- This line would be manually added. nifi.content.repository.directory.repo3=/opt/nifi/content_repo3/content_repository <-- This line would be manually added. *** NiFi will do file based striping across all content repos. Thanks, Matt

Member Since	‎07-30-2019 10:41 AM
Last Visited
Posts	3,131
Kudos received	1560

Cloudera Community

Re: Flowfile stuck in Wait in EnforceOrder process...

Re: Untrusted proxy error Authentication Failed o....

Re: REST API Configuration for NiFi 2.0

Re: Fileflow penalized for certain time before all...

Re: Nifi : Implement Sleep Mechanism in nifi witho...

Re: Nifi - putsql for phoenix upsert very slow - i...

Re: [NiFi] Combine FlowFile Attribute and Content

Re: Load balancing while the fetching of file fro...

Re: handlehttprequest / handlehttpresponse error w...

Re: NiFI Server Configuration

Re: Getting untrusted proxy message while trying t...

Re: Component level Access in NiFi

Re: NiFI Server Configuration

Re: NiFI Server Configuration

Re: NiFI Server Configuration