About MattWho

MattWho · ‎04-24-2017

@Avijeet Dash Is the intent to manipulate these large files in anyway once they have been ingested in to NiFi? NiFi has no problem ingesting files of any type or size provided sufficient space exists in the content repository to store that data. For performance, Nifi only passes a FlowFile references between processors within the NiFi dataflow. Even if you "clone" a large file down two or more dataflow path, this only results in an additional reference FlowFile to the same content in the content repository. All rFlowFile references to the same content must be resolved before the actual content is removed from the repository. That being said, NiFi provides a multitude of processors for manipulating the content of FlowFiles. Anytime you modify/change the content of a FlowFile, a new FlowFile is created along with the new content. This is important because following this new content creation, you still have the original as well as your new version of the content in your content repository. So you must plan accordingly if manipulation of the content is to be done to make sure you have sufficient repository storage. JVM memory comes in to the mix most noticeably when doing any splitting of large content in to many smaller content. So if you plan on producing more then say 10,000 individual FlowFiles for a single Large FlowFile, you will likely need additional JVM memory allocation to your NiFi. As you can see a lot more needs to be considered beyond just the size of teh data being ingested when planning out your NiFi needs. Hope this helps, Thanks, Matt

MattWho · ‎04-24-2017

@Avish Saha Some processors work in NiFi FlowFiles in batches. In your case one of the FlowFiles is failing to match on the regex which is causing a session roll back of the entire session. Also seeing as how the processor reports 8,000,000+ tasks in the last 5 minutes, it does not look like it is penalizing that one bad FlowFile. While all FlowFiles routed to a "failure" relationship are penalized, this is not true for all processors when a session rollback occurs.

MattWho · ‎04-24-2017

@TARANA POLEKAR What properties you need to set depends on whether you are using RAW or HTTP to transfer data over S2S between your NiFi instances. The S2S properties must be configured on the target NiFi (the one NOT running the RPG). When using RAW or HTTP the initial communications with the target NiFi instance is to the targets's HTTP(s) port. This allows you to setup your dataflow to your S2S. When it comes to data transfer, With the RAW format, data is trying to be transferred to the configured nifi.remote.input.host= at configured nifi.remote.input.socket.port= . With HTTP format, data is transferred to nifi.remote.input.host= at the nodes http(s) port. My guess here is that you are using the default RAW protocol, and you source instances cannot communicate with the nifi.remote.input.port=. Other issue may be if nifi.remote.input.host= is not set, Java tries to determine hostname itself and that my be resolving to localhost causing communications issues, Thanks, Matt

MattWho · ‎04-21-2017

Note: This article was written as of HDF 2.1.2 release which is based off Apache NiFi 1.1.0. NiFi has several processors that can be used to retrieve FlowFiles from a SFTP server. It is important to understand the different capabilities each provides so you know when you should be using one vs another. Lets start with the oldest of the available processors: GetSFTP The GetSFTP processor is the original processor introduced for retrieving files from a remote SFTP server. The processor has the following configurable properties: Things to know about this processor: 1. Disadvantage: This processor does not retain any state. This means it does keep track of which files it has previously retrieved. So if property "Delete Original" is set to "false", this processor will continue to retrieve the same file over and over again. 2. This processor is not cluster friendly, meaning in a NiFi cluster it should be set to run on "primary node only" so that every node in the cluster is not competing to pull the same data. 3. In NiFi cluster, the data retrieved by the GetSFTP processor should be redistributed across all nodes before further processing is done. This spreads out the work load to all nodes so the "primary node" is not doing all the work. The above sample shows a "Remote Process Group" being used to redistribute the data from the GetSFTP processor to all nodes within the cluster. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Disadvantage: All data content is pulled in to the primary node before being distributed across cluster. Disadvantage: Since GetSFTP processor needs to delete source file in order to prevent continuos consumption, the data is unavailable to other users/servers. Note: This processor has been deprecated in favor of the newer ListSFTP and FetchSFTP processors. It still exists to maintain backwards compatibility for NiFi users. ----------------------------------------------------- Now let's talk about the ListSFTP and Fetch SFTP processors and what disadvantages above were solved by these processors. The ListSFTP processor is designed to connect to a SFTP server just like GetSFTP did; however, it does not actual retrieve the data. Instead it creates a 0 byte FlowFile for every File it lists from the SFTP server. The FetchSFTP processor takes these 0 byte FlowFiles as input and actually retrieves the associated data and inserts it in to the FlowFile content a that time. I know it sounds like we just replicated what GetSFTP processor does but split it between two processors, but there are key advantages to doing it this way. 1. The ListSFTP processor does maintain state across a NiFi cluster. So if you leave do not delete the source data, this processor will not pickup the same data multiple times like the GetSFTP processor will. 2. While the ListSFTP processor is still not cluster friendly, meaning it should be run on Primary Node only, the FetchSFTP processor is cluster friendly. The ListSFTP processor should be used to create the 0 byte FlowFiles and then use a Remote Process Group to distribute these FlowFiles across your cluster. Then the Fetch SFTP processor is used on every node to retrieve the actual FlowFile content from the SFTP server. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Advantage: The Primary node is no longer using excess resources writing all content to its content repository before redistributing the FlowFiles to all nodes in the cluster. Advantage: Cluster wide state allows Primary node to switch within your NiFi cluster and the ListSFTP processor will still not list the same files twice. Advantage: Being able to leave files on SFTP server, allows that data be consumed by other end users/systems. Disadvantage: Using an RPG to redistribute the listSFTP generated FlowFiles can be annoying since the remote input port the RPG sends to must exist on root canvas level. So if flow is nested down in a sub-process group, you must build a flow that feeds the load-balanced FlowFiles back down in to that sub process group. --------------------------------------- You will find within NiFi several other examples of where processors have been deprecated for newer list/fetch based processors. Thank you, Matt

MattWho · ‎04-21-2017

@John T Sounded a lot like a back pressure scenario to me when you first described what was going on. Glad you were able to resolve you issue. I also saw your other post and commented on it.

MattWho · ‎04-21-2017

@Gu Gur You just need to make sure the following lines in the new nodes nifi.properties file match the same lines in the nifi.properties files of those nodes already in your cluster. I also recommend setting/updating the following line in the nifi.properties file on every node in your cluster each time you add or remove a node: nifi.cluster.flow.election.max.candidates= It is blank by default which means the UI will be unavailable for 5 minutes whenever the cluster is restarted while NiFi waits for the leader election to complete. Of course all your rest of the "cluster node properties" must be set on your new nodes as well. Any changes to the nifi.properties file will require a NiFi restart to take affect Thanks, Matt

MattWho · ‎04-20-2017

@Avish Saha Why the need for multiple Distributed Map Cache Servers? The server runs independent of the process group it was created under. It is bound to a port and is available for connections from any properly configured Distributed Map Cache client service. You can also configure one Distributes Map Cache Client Service at the parent process group level and it will be available to any sub process group, but there is nothing wrong with creating multiple clients either. Thanks, Matt

MattWho · ‎04-20-2017

Is NiFi reporting a lot of Garbage collection going on? Do you have a large number of FlowFiles queued in your dataflow(s)? Does a data provenance query run against your RouteOnAttribute processor show extremely slow processing of FlowFiles or a complete stoppage? Does a restart of NiFi trigger it to process files again?

MattWho · ‎04-20-2017

@John T What version of NiFi are you running? Do any of the connections leading from the RouteOnAttribute processor have back pressure being applied? If you are running a new enough version the connection will be highlighted red when back pressure is being applied. If back pressure is being applied the source processor to that connection will not be allowed to run. Can you share a screenshot of your dataflow? If the above is not the case, have you checked your nifi-app.log for any errors? look for Out of Memory (OOM) ERRor for example. Thanks, Matt

MattWho · ‎04-20-2017

@John T If you are using the listSFTP processor before your FetchSFTP processor , it will produce a zero byte flow flowfile for every FlowFile it finds on the target SFTP server. The listSFTP processor has a "File Filter Regex" where you can specify a java regular expression to limit what is returned to just files containing "file123.txt". For example "*file123.txt" The ListSFTP processor also maintains state so that the same files are not listed each time. so only new files containing file123.txt are listed each time it runs. The FetchSFTp processor is designed to return the content of a specific file and insert it as content to the FlowFile that he FetchSFTP processor is running against. Thanks, Matt

Online	Offline
Last Visited	‎01-25-2026 09:28 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎01-25-2026 09:28 AM
Posts	3,426
Kudos received	1627

Cloudera Community

Re: Best Practice for configuring registry flows

Re: Nifi 2.7.2 Start Problem

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: How to elevate a default nifi user to admin - ...

Re: Nifi for batch ingest

Re: URGENT query regarding ReplaceText error and p...

Re: Remote Process Group error

How-to: Retrieve files from a SFTP server using Ni...

Re: RouteOnAttribute Processor will not Process da...

Re: NIFI: Node discovering for Cluster

Re: DistributedMapCacheServer usage at ProcessGrou...

Re: RouteOnAttribute Processor will not Process da...

Re: RouteOnAttribute Processor will not Process da...

Re: Does the FetchSftp Processor support wildcards...