About MattWho

MattWho · ‎01-30-2017

@Saminathan A The PutSQL processor expects that each FlowFile contains a single SQL statement and does not support multiple insert statements as you have tried above. You can have the GetFile Processor route its success relationship twice with each success going to its own ReplaceText processor. Each ReplaceText processor is then configured to create either the "table_a" or "table_b" insert statement. The success from both ReplaceText processors could then be routed to the same PutSQL processor. Thanks, Matt

MattWho · ‎01-27-2017

With HDF 2.x, Ambari can be used to deploy a NiFi cluster. Lets say you deployed a 2 node cluster and want to go back at a later time and add an additional NiFi node to the cluster. While the process is very straight forward when your NiFi cluster has been setup non-secure (http), the same is not true if your existing NiFi cluster has been secured (https). Below you will see an existing 2 node secured NiFi cluster that was installed via Ambari: STEP 1: Add new host through Ambari. You can skip this step if the host you want to install the additional NiFi node on is already managed by your Ambari. STEP 2: Under "Hosts" in Ambari click on the host form the list where you want to install the new NiFi node. The NiFi component will be in a "stopped" state after it is installed on this new host. *** DO NOT START NIFI YET ON NEW HOST OR IT WILL FAIL TO JOIN CLUSTER. *** STEP 3: (This step only applies if NiFi's file based authorizer is being used) Before starting this new node we need to clear out some NiFi Configs. This step is necessary because of how the NiFi application starts. When NiFi starts it looks for the existence of a users.xml and authorizations.xml files. If they do not exist, it uses the configured "Initial Admin Identity" and "Node identities (1,2,3, etc...)" to build the users.xml and authorizations.xml files. This causes a problem because your existing clusters users.xml and authorizations.xml files likely contain many more entires by now. Any mismatches in these files will prevent a node from being able to join the cluster. If these configurations are not present, the new node will grab them from the cluster it joins. Below shows what configs need to be cleared in NiFi: *Note: Another option is to simply copy the users.xml and authorizations.xml files from an existing cluster node to the new node before starting the new node. STEP 4: (Do this step if using Ambari metrics) When a new node is added by Ambari and Ambari metrics are also enabled, Ambari will create a flow.xml.gz file that contains just the ambari reporting task. Later when this node tries to join the cluster, the flow.xml.gz files between this new node and the cluster will not match. This mis-match will trigger the new node to fail to join cluster and shut back down. In order to avoid this problem the flow.xml.gz file must be copied from one of the cluster's existing nodes to this new node. STEP 5: Start NiFi on this new node. After the node has started, it should successfully join your existing cluster. If it fails, the nifi-app.log will explain why, but will likely be related to one of the above configs not being cleared out causing the users.xml and authorizations.xml files to get generated rather then inherited from the cluster. If that is the case you will need to fix the configs and delete those files manually before restarting the node again. STEP 6: While you cluster is now up and running with the additional node, but you will notice you cannot open the UI of that new node without getting an untrusted proxy error screen. You will however still be able to access your other two node's UIs. So we need to authorize this new node in your cluster. A. If NiFi handles your authorizations, follow this procedure: 1. Log in to the UI of one of the original cluster nodes. The "proxy user requests" access policies is needed to allow users to access the UI of your nodes. NOTE: There may be additional component level access policies (such as "view the data" and "modify the data") you may also want to authorize this new node for. B. If Ranger handles your NiFI authorizations, follow this procedure: 1. Access the Ranger UI: 2. Click Save to create this new user for your new node. Username MUST match exactly with the DN displayed in the untrusted proxy error screen. 3. Access the NiFi service Manager in Ranger and authorize your new node to your existing access policies as needed: You should now have a full functional new node added to your pre-existing secured NiFi cluster that was deployed/installed via Ambari.

MattWho · ‎01-12-2017

@Raj B You can think of the "Max Bin Age" as the trump card. Regardless of any other min criteria being met, the bin will be merged once it reaches this max age. So you assumption is completely correct. That aside, you need to take heap usage into consideration with this dataflow design you have here. FlowFile attributes (metadata) lives in heap memory space for performance issues. So as you are bining these FlowFiles throughout the day, your JVM heap usage is going to grow and grow. So how many FlowFiles per day are you talking about here? If you are talking in excess of 10,000 FlowFiles, you may need to adjust your dataflow some. For example use two mergeContent processors back to back. The first merges at lets say a max bin age of 5 minutes. Then the second merges those bundles into a large 24 hour bundle. So 1 new FlowFile is created every 5 minutes and then those 288 merged FlowFiles are merged into a larger FlowFile in the second mergeContent. Doing it this greatly reduces the heap usage. Of course depending on volumes you may need to even merge more often then 5 minutes to achieve optimal heap usage. Just some food for thought.... Matt

MattWho · ‎01-12-2017

@Raj B If you have FlowFiles arriving via multiple input ports and then passing through some common set of components downstream from them, there is no way to tell by looking at the FlowFile in a given queue which input port it originated from. Input and output ports within a process group do not create provenance events either since they do not modify the FlowFiles in anyway. The only way an input port or output port would generate a Provenance event is if it was on the root canvas level since inout would generate a "create" event and output ports would create a "drop" event. Provenance will show a lineage for a FlowFile which will show any processor which routed or modified the FlowFile in some way. So by looking at the details of the various events in the provenance lineage graph you can see where the FlowFile traversed through your Flow. However, as I stated not all processors create provenance events. When you query provenance, you can access the lineage for any of the query results by clicking the show lineage icon: A lineage graph for the specific FlowFile will then be created and displayed: The red dot show the event the lineage was calculate from. Every circle is another event in this particular FlowFiles life. You can right click on any of the events to view the details of the event including which specific processor in your flow produced that event. Thanks, Matt

MattWho · ‎01-10-2017

@Raj B Process groups can be nested inside process groups and with the granular access controls NiFi provides i may not be desirable for every user who has access to the NiFi Ui to be able to access all processors or the specific data those processors are using. So in addition to your valid example above, you may want to create stove pipe dataflows based off different input ports where only specific users are allowed view and modify to the stove pipe dataflow they are responsible for. While you of course can have flowfiles from multiple upstream sources feed into a single input port and then use a routing type processor to split them back out to different dataflows, it can be easier just to have multiple input ports to achieve the same affect with less configuration. Matt

MattWho · ‎01-09-2017

@Narasimma varman Is your NiFi a single instance of NiFi or a NiFi cluster? If it is a cluster, keep in my mind by default the GetFile processor will be running on every node in that cluster. The validate will also run on every node as well, so make sure the directory exists on all nodes. Also make sure you have only specified a d directory path in the "Input Directory" property in GetFile. In you case, you should have only "/root/example" for that property. The filename you wish to pickup should be specified in the "File Filter" property. Matt

MattWho · ‎01-04-2017

@bala krishnan 1. "Concurrent tasks" is nothing new to NiFi. There currently is no capability to set concurrency at the process group level and I am not sure that would be a good idea. I would assume you are looking for a way to set a number of "concurrent tasks" that would then get applied to every processor within a process group? Some processors involve tasks that are more cpu intensive then others. For example: CompressContent processor is cpu intensive. For every concurrent task it i assigned, 100% of cpu core is consumed for each file it compresses/decompresses. adding to many "concurrent tasks" here could have serious impact on the system hosting NiFi. UpdateAttribute processor on the other hand typically has very little CPU impact. One concurrent task here can process batches of FlowFiles very rapidly, so multiple concurrent tasks is usually unnecessary and a waste of server resources. 2. There is no defined algorithm for how many concurrent tasks a processor should receive out of the gate. Concurrent Tasks assignment is done through testing and fine tuning a dataflow using production data samples at production volumes. Evaluating your dataflow for bottlenecks in combination with tracking systems resource loads (CPU, Memory, network and disk I/O) can help tune concurrent task settings appropriately . Its is two often the case where users start off with assigning a high number of concurrent task rather then starting at the bottom. You have to remember that your system has only so much CPU to share. Assigning to many concurrent tasks to a single processor will hinder other processors who are looking for cpu time. Along with setting "concurrent tasks" on individual processors, there are global maximum timer and event driven thread settings in NiFi (Defaults are 10 and 5 respectively). These control the maximum number of threads NiFi will request from the server that will be used to fulfill the concurrent task request from the NiFi processor components. These global values can be adjusted in "controller settings" (Located via the hamburger menu in the upper right corner of the NiFi UI.) Typical setting here are double to quadruple the number of CPU cores you have on your server. Giving excessive values here doe snot improve performance as those threads just spend more time in CPU wait. Thanks, Matt

MattWho · ‎12-21-2016

@Sunile Manjee Also keep in mind that NiFi Content archiving is enabled by default with a retention period of 12 hours or 50% disk utilization before the archived content is removed/purged. The purging of FlowFile manually within your dataflow will not trigger the deltion of archived FlowFiles.

MattWho · ‎12-14-2016

@Sunile Manjee FlowFile Content is stored in claims inside the content repo. Each claim can contain the content from 1 or more FlowFiles. A claim will not be moved to content Archive or purged from the content repository until all active FlowFiles in your dataflow that have references to any of the content in that claim have been removed. Those FlowFiles can be removed via manual purging of the queues (Empty Queue), Flow file expiration on a connection or via auto-termination at the end of a dataflow. The FlowFile count and size reported in the UI does not reflect the size of the claims the content repo. Those stats report the size and number of active FlowFiles queued in your flow. It is very likely and usual to see the size reported in the UI to differ from actual disk usage. Thanks, Matt

MattWho · ‎12-06-2016

@kumar The default FlowFile attributes include: entryDate lineageStartDate fileSize filename path uuid The above FlowFile attribute key names are case sensitive. Thanks, Matt

Online	Online
Last Visited	‎07-08-2026 11:18 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-08-2026 11:18 AM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: How to execute multiple SQL query in NIFI PUT ...

HDF 2.x - Adding a new NiFi Node to an existing se...

Re: NiFi MergeContent behavior when Correlation At...

Re: Couple of questions on processor group with mu...

Re: What's the purpose of multiple input and outpu...

Re: How to load data from local system file to HDF...

Re: How to improve nifi concurrency

Re: Is content purged when flow files are deleted?

Re: Is content purged when flow files are deleted?

Re: NiFi: Capture filename and filesize