Member since
07-30-2019
3161
Posts
1569
Kudos Received
915
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
82 | 01-31-2025 09:38 AM | |
93 | 01-30-2025 06:29 AM | |
159 | 01-29-2025 06:14 AM | |
131 | 01-27-2025 11:06 AM | |
86 | 01-21-2025 06:16 AM |
04-25-2017
09:22 PM
2 Kudos
@Simon Jespersen
If you cannot get this to work outside of NiFi, it is not going to work inside of NiFi either. But looking over your statement above, I see a couple things... 1. You are trying to use a "ppk" file. This is a Putty Private Key which is not going to be supported by SFTP. You should be using a private key in pem format. 2. SSH is very particular about permissions set on private keys. SSH will reject the key if the permissions are to open. Once you have you pem key make a copy of it for your NiFi application and make sure that copy is owned by the user running NiFi. The permissions also must be 600 on the private key. nifi.root 770 (-rwxrwx---) will not be accepted by SSH
nifi.root 600 (-rw-------) will be accepted. You can't grant groups access to your private key. Thanks, Matt
... View more
04-25-2017
06:36 PM
@Avijeet Dash
Once a File is ingested in to NiFi and becomes a FlowFile, it will remain in NiFi's content repository until all FlowFiles active in your Dataflow that point at that content claim have been satisfied. By satisfied, I mean they have reached a point in your dataflow(s) where those FlowFiles have been auto-terminated. If FlowFile archiving is enabled in your NiFi, the FlowFile content will be moved to an archive directory once no active FlowFiles are pointed at it any longer. The length of time it will be retained in the archive directory is determined by the archive configuration properties in the nifi.properties file. The defaults for archive are enabled with usage set to 12 hours or 50% disk utilization. Thanks, Matt
... View more
04-25-2017
06:24 PM
@Bala Vignesh N V The NiFI admin guide covers installing NiFi on Windows. It is done via the command line using the NIFI tar.gz file. http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.1.0/bk_dataflow-administration/content/how-to-install-and-start-nifi.html Thanks, Matt
... View more
04-25-2017
06:17 PM
@Dmitro Vasilenko The ConsumeKafka processor will only accept dynamic properties for Kafka consumers only. max.message.bytes is a server configuration property. I believe what you are really looking for on the consumer side is: max.partition.fetch.bytes This property will be accepted by the consumeKafka processor and you will not see the "Must be a known configuration parameter for this Kafka client" invalid tooltip notification. Thanks, Matt Just as an FYI, I don't get pinged about any new answers/comments you make without the @<username> notation.
... View more
04-25-2017
05:25 PM
@Simon Jespersen Posted answer to above question here: https://community.hortonworks.com/questions/98384/listsftp-failed-to-obtain-connection-to-remote-hos.html
... View more
04-24-2017
04:33 PM
3 Kudos
@Avijeet Dash Is the intent to manipulate these large files in anyway once they have been ingested in to NiFi? NiFi has no problem ingesting files of any type or size provided sufficient space exists in the content repository to store that data. For performance, Nifi only passes a FlowFile references between processors within the NiFi dataflow. Even if you "clone" a large file down two or more dataflow path, this only results in an additional reference FlowFile to the same content in the content repository. All rFlowFile references to the same content must be resolved before the actual content is removed from the repository. That being said, NiFi provides a multitude of processors for manipulating the content of FlowFiles. Anytime you modify/change the content of a FlowFile, a new FlowFile is created along with the new content. This is important because following this new content creation, you still have the original as well as your new version of the content in your content repository. So you must plan accordingly if manipulation of the content is to be done to make sure you have sufficient repository storage. JVM memory comes in to the mix most noticeably when doing any splitting of large content in to many smaller content. So if you plan on producing more then say 10,000 individual FlowFiles for a single Large FlowFile, you will likely need additional JVM memory allocation to your NiFi. As you can see a lot more needs to be considered beyond just the size of teh data being ingested when planning out your NiFi needs. Hope this helps, Thanks, Matt
... View more
04-24-2017
02:15 PM
@Avish Saha Some processors work in NiFi FlowFiles in batches. In your case one of the FlowFiles is failing to match on the regex which is causing a session roll back of the entire session. Also seeing as how the processor reports 8,000,000+ tasks in the last 5 minutes, it does not look like it is penalizing that one bad FlowFile. While all FlowFiles routed to a "failure" relationship are penalized, this is not true for all processors when a session rollback occurs.
... View more
04-24-2017
01:44 PM
@TARANA POLEKAR What properties you need to set depends on whether you are using RAW or HTTP to transfer data over S2S between your NiFi instances. The S2S properties must be configured on the target NiFi (the one NOT running the RPG). When using RAW or HTTP the initial communications with the target NiFi instance is to the targets's HTTP(s) port. This allows you to setup your dataflow to your S2S. When it comes to data transfer, With the RAW format, data is trying to be transferred to the configured nifi.remote.input.host= at configured nifi.remote.input.socket.port= . With HTTP format, data is transferred to nifi.remote.input.host= at the nodes http(s) port. My guess here is that you are using the default RAW protocol, and you source instances cannot communicate with the nifi.remote.input.port=. Other issue may be if nifi.remote.input.host= is not set, Java tries to determine hostname itself and that my be resolving to localhost causing communications issues, Thanks, Matt
... View more
04-21-2017
07:23 PM
3 Kudos
Note: This article was written as of HDF 2.1.2 release which is based off Apache NiFi 1.1.0. NiFi has several processors that can be used to retrieve FlowFiles from a SFTP server. It is important to understand the different capabilities each provides so you know when you should be using one vs another. Lets start with the oldest of the available processors: GetSFTP The GetSFTP processor is the original processor introduced for retrieving files from a remote SFTP server. The processor has the following configurable properties: Things to know about this processor: 1. Disadvantage: This processor does not retain any state. This means it does keep track of which files it has previously retrieved. So if property "Delete Original" is set to "false", this processor will continue to retrieve the same file over and over again. 2. This processor is not cluster friendly, meaning in a NiFi cluster it should be set to run on "primary node only" so that every node in the cluster is not competing to pull the same data. 3. In NiFi cluster, the data retrieved by the GetSFTP processor should be redistributed across all nodes before further processing is done. This spreads out the work load to all nodes so the "primary node" is not doing all the work. The above sample shows a "Remote Process Group" being used to redistribute the data from the GetSFTP processor to all nodes within the cluster. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Disadvantage: All data content is pulled in to the primary node before being distributed across cluster. Disadvantage: Since GetSFTP processor needs to delete source file in order to prevent continuos consumption, the data is unavailable to other users/servers. Note: This processor has been deprecated in favor of the newer ListSFTP and FetchSFTP processors. It still exists to maintain backwards compatibility for NiFi users. ----------------------------------------------------- Now let's talk about the ListSFTP and Fetch SFTP processors and what disadvantages above were solved by these processors. The ListSFTP processor is designed to connect to a SFTP server just like GetSFTP did; however, it does not actual retrieve the data. Instead it creates a 0 byte FlowFile for every File it lists from the SFTP server. The FetchSFTP processor takes these 0 byte FlowFiles as input and actually retrieves the associated data and inserts it in to the FlowFile content a that time. I know it sounds like we just replicated what GetSFTP processor does but split it between two processors, but there are key advantages to doing it this way. 1. The ListSFTP processor does maintain state across a NiFi cluster. So if you leave do not delete the source data, this processor will not pickup the same data multiple times like the GetSFTP processor will. 2. While the ListSFTP processor is still not cluster friendly, meaning it should be run on Primary Node only, the FetchSFTP processor is cluster friendly. The ListSFTP processor should be used to create the 0 byte FlowFiles and then use a Remote Process Group to distribute these FlowFiles across your cluster. Then the Fetch SFTP processor is used on every node to retrieve the actual FlowFile content from the SFTP server. *** As of Apache NiFi 1.8 release a new capability has been added to NiFi to facilitate easy redistribution of FlowFiles without needing to use a RPG. It can be done by simply configuring any connection to perform the redistribution. This blog explains that new capability very well: https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster *** Advantage: The Primary node is no longer using excess resources writing all content to its content repository before redistributing the FlowFiles to all nodes in the cluster. Advantage: Cluster wide state allows Primary node to switch within your NiFi cluster and the ListSFTP processor will still not list the same files twice. Advantage: Being able to leave files on SFTP server, allows that data be consumed by other end users/systems. Disadvantage: Using an RPG to redistribute the listSFTP generated FlowFiles can be annoying since the remote input port the RPG sends to must exist on root canvas level. So if flow is nested down in a sub-process group, you must build a flow that feeds the load-balanced FlowFiles back down in to that sub process group. --------------------------------------- You will find within NiFi several other examples of where processors have been deprecated for newer list/fetch based processors. Thank you, Matt
... View more
Labels:
04-21-2017
05:19 PM
@John T Sounded a lot like a back pressure scenario to me when you first described what was going on. Glad you were able to resolve you issue. I also saw your other post and commented on it.
... View more