About MattWho

MattWho · ‎05-03-2017

@Sertac Kaya FlowFiles are transferred in a batches between process groups, but that transfer amounts to a updated FlowFile records. This transfer should take fractions of a ms to complete. So many threads should execute per second. So this raises the question of whether your flow is thread starved, concurrent tasks have been over allocated across your processors, your NiFi max timer driven thread count is to low, or your disk IO is very high. I would start by looking at your "Max Timer Driven Thread Count" settings. The default is only 10. By default every component you add to the NiFi canvas uses Timer driven threads. The above count restricts how many system thread can be allocated to components at any one time. I setup a simple 4 cpu vm running a default configuration. The number of FlowFiles passed through the connection between process group 1 and process group 2 ranged between 7084/second to 12,200/second. Thanks, Matt

MattWho · ‎05-02-2017

@Sertac Kaya A few questions come to mind... 1. What kind of processor is feeding the connection with the large queue inside ExampleA? 2. How large is that queue? The reason I ask is because NiFi uses swapping to help to limit JVM heap usage by queued FlowFiles. How swapping is handled is configured in the nifi.properties file: nifi.queue.swap.threshold=20000 nifi.swap.in.period=5 sec nifi.swap.in.threads=1 nifi.swap.out.period=5 sec nifi.swap.out.threads=4 The above shows NiFi defaults. A few options you may do to improve performance: 1. Set backpressure thresholds on you connections to limit the number of FlowFiles that will queue at any time. Setting the value lower then they swapping threshold will prevent swapping from occurring on the connection. Newer version of NiFi by default set FlowFile object thresholds on newly created connections to 10,000. swapping is per connection and not per NiFi instance. 2. Adjust the swap.threshold value to a large value to prevent swapping. Keep in mind that any FlowFiles not being swapped are held in JVM heap memory. Setting this value to high may result in Out Of Memory (OOM) errors. Make sure you adjust your heap setting fro your NiFi in the bootstrap.conf file. 3. Adjust the swap in and swap out number of threads. Thanks, Matt

MattWho · ‎04-27-2017

@Anishkumar Valsalam In a Nifi cluster you need to make sure you have uploaded you new custom component nars to every node in the cluster. I do not recommend adding your custom nars directly to the existing NiFi lib directory. While this works, it can become annoying to manage when you upgrade NiFi versions. NiFi allows you to specify an additional lib directory where you can place your custom nars. Then if you upgrade, the new version can just get pointed at this existing additional lib dir. Adding additional lib directories to your NiFi is as simple adding an additional property to the nifi.properties file. for example: nifi.nar.library.directory.lib1=/nars-custom/lib1 nifi.nar.library.directory.lib2=/nars-custom/lib2 Note: Each prefix must be unique (i.e. - lib1 and lib2 in the above examples). These lib directories must be accessible by the user running your NiFi instance. Thanks, Matt

MattWho · ‎04-25-2017

@Simon Jespersen Posted answer to above question here: https://community.hortonworks.com/questions/98384/listsftp-failed-to-obtain-connection-to-remote-hos.html

MattWho · ‎04-21-2017

Note: This article was written as of HDF 2.1.2 release which is based off Apache NiFi 1.1.0. NiFi has several processors that can be used to retrieve FlowFiles from a SFTP server. It is important to understand the different capabilities each provides so you know when you should be using one vs another. Lets start with the oldest of the available processors: GetSFTP The GetSFTP processor is the original processor introduced for retrieving files from a remote SFTP server. The processor has the following configurable properties: Things to know about this processor: 1. Disadvantage: This processor does not retain any state. This means it does keep track of which files it has previously retrieved. So if property "Delete Original" is set to "false", this processor will continue to retrieve the same file over and over again. 2. This processor is not cluster friendly, meaning in a NiFi cluster it should be set to run on "primary node only" so that every node in the cluster is not competing to pull the same data. 3. In NiFi cluster, the data retrieved by the GetSFTP processor should be redistributed across all nodes before further processing is done. This spreads out the work load to all nodes so the "primary node" is not doing all the work. The above sample shows a "Remote Process Group" being used to redistribute the data from the GetSFTP processor to all nodes within the cluster. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Disadvantage: All data content is pulled in to the primary node before being distributed across cluster. Disadvantage: Since GetSFTP processor needs to delete source file in order to prevent continuos consumption, the data is unavailable to other users/servers. Note: This processor has been deprecated in favor of the newer ListSFTP and FetchSFTP processors. It still exists to maintain backwards compatibility for NiFi users. ----------------------------------------------------- Now let's talk about the ListSFTP and Fetch SFTP processors and what disadvantages above were solved by these processors. The ListSFTP processor is designed to connect to a SFTP server just like GetSFTP did; however, it does not actual retrieve the data. Instead it creates a 0 byte FlowFile for every File it lists from the SFTP server. The FetchSFTP processor takes these 0 byte FlowFiles as input and actually retrieves the associated data and inserts it in to the FlowFile content a that time. I know it sounds like we just replicated what GetSFTP processor does but split it between two processors, but there are key advantages to doing it this way. 1. The ListSFTP processor does maintain state across a NiFi cluster. So if you leave do not delete the source data, this processor will not pickup the same data multiple times like the GetSFTP processor will. 2. While the ListSFTP processor is still not cluster friendly, meaning it should be run on Primary Node only, the FetchSFTP processor is cluster friendly. The ListSFTP processor should be used to create the 0 byte FlowFiles and then use a Remote Process Group to distribute these FlowFiles across your cluster. Then the Fetch SFTP processor is used on every node to retrieve the actual FlowFile content from the SFTP server. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Advantage: The Primary node is no longer using excess resources writing all content to its content repository before redistributing the FlowFiles to all nodes in the cluster. Advantage: Cluster wide state allows Primary node to switch within your NiFi cluster and the ListSFTP processor will still not list the same files twice. Advantage: Being able to leave files on SFTP server, allows that data be consumed by other end users/systems. Disadvantage: Using an RPG to redistribute the listSFTP generated FlowFiles can be annoying since the remote input port the RPG sends to must exist on root canvas level. So if flow is nested down in a sub-process group, you must build a flow that feeds the load-balanced FlowFiles back down in to that sub process group. --------------------------------------- You will find within NiFi several other examples of where processors have been deprecated for newer list/fetch based processors. Thank you, Matt

MattWho · ‎04-06-2017

@Ahmad Debbas The GetHDFS processor is deprecated in favor of using ListHDFs and FetchHDFS processors. The GetHDFS processor does not retain state and therefore will start over from the beginning as you noted when an error occurs. The ListHDFS processor does maintain state, so even through NiFi restarts or processor restarts, the listing picks up where it left off. The zero byte FlowFiles produced are then passed to a FetchHDFS that actually retrieves the content and inserts it into the existing FlowFile. Another advantage to the list/fetch design model is the ability to distribute those listed zero byte files across a Nifi cluster before fetching the content. This improves performance by reducing resource strain caused by GetHDFS on a single NiFi node. Thanks, Matt

MattWho · ‎03-29-2017

@Bram Klinkenberg The "Roles" noted above are only valid for us in the older Apache NiFi 0.x baseline. They were part of the authorized-users.xml file used in that baseline. The Apache NiFi 1.x baseline added support for multi-tenancy and a granular access control via access policies. It is an entirely new authorization method and uses different files. There is no notion of Roles in NiFI 1.x. The authorizers.xml file allows you to specify a legacy authorized-usesr.xml file in place of configuring an "Initial Admin Identity" simply to make it easy for user of NiFi 0.x to port their existing users over to NiFi 1.x. Matt

MattWho · ‎03-29-2017

@Bram Klinkenberg The users.xml and authorizations.xml files are generated for you the first time NiFi is started after being secured. Initially they are populated using the configuration from the authorizers.xml file. In that file you specified an "Initial Admin Identity" (assuming you used CN=admin). As a result a user (CN=admin) was added to the users.xml file and the relevant "admin" related access policies were assigned to that user in the authorizations.xml file. At this point your user (CN=admin) should be able to access the NiFi UI. The admin will use the NIFi UI to add additional users and authorize them for various access policies: Users are managed and Global Policies are applied as follows: Adding "Users" within NiFi has nothing to do with user authentication. The users you add here are for authorization to NiFi resources only. User Authentication must occur first and can be accomplished using User issued certs load in browser, Kerberos, or LDAP. Global access policies include the following: Component (Processors, process groups, and other things on canvas) level access policies are assigned to users as follows: Component level access policies include: Some Component level access policies are on available to specific components. If the currently selected component does not support the policy it will be greyed out in the list. More detail on teh various access policies can be found in teh admin guide: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#config-users-access-policies Thank you, Matt

MattWho · ‎03-23-2017

@Simon Jespersen The "Host Key File" property is used to specify the file containing your trusted hosts (commonly named known_hosts and found by default in the .ssh directory). It is not the key you are using to connect with. This property works in conjunction with the "Strict Host Key Checking" property when it is set to "true" You are getting a key does not exist because this property ins un related to "Private Key Path" so NiFi is looking in its base default directory for this file. I believe what you are trying to do has nothing to do with host key checking. You want to configure the "Private Key Path" as the full path to your private key including the key name itself. Thank you, Matt

MattWho · ‎03-16-2017

@mayki wogno How are you executing your provenance query? Are you selecting "Data Provenance" from within the hamburger menu in the upper right of the Ui or are you selecting "Data provenance" from the context menu that appears by right clicking on your listHDFS processor? The above performs a global provenance query of all your dataflows by default unless you add a filter. triggering a provenance query through a specific components context menu will add a filter based upon that components UUID. Thanks, Matt

Online	Online
Last Visited	‎07-08-2026 06:09 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-08-2026 06:09 PM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: How to speed up Nifi FlowFile transfer from Pr...

Re: How to speed up Nifi FlowFile transfer from Pr...

Re: Steps to deploy custom Nars

Re: configuring listSftp processor with keyfile

How-to: Retrieve files from a SFTP server using Ni...

Re: GETHDFS recrawling problem

Re: User management NiFi

Re: User management NiFi

Re: configuring listSftp processor with keyfile

Re: NIFI : weird situation about data provenance