About MattWho

MattWho · ‎05-17-2017

@Simon Jespersen You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows: Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content. The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor. So the Flow would look something like this: The ReplaceText would be configured as follows: and your MergeContent processor would be configured something like this: As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion. Thanks, Matt

MattWho · ‎05-17-2017

@Toky Raobelina I am really not sure what version of NiFi you are running. The Authentication strategy "LDAPS" is different then "START_TLS". LDAPS support in NiFi's login identifier is a fairly new addition. Consult your NiFi's imbedded admin guide documentation to verify that LDAPS is an option. If so you will want to first change your login-idnetity-providers.xml fiel configuration to use LDAPS instead of START_TLS. Next you will want to confirm your ldap servers URL. Typically ldaps urls start with ldaps:// instead of just ldap://. You also mentioned that you add the ldaps servers public key to NiFi's keystore. NiFI use 2-way TLS authentication. You should have added the ldaps servers public key as a trustedCertEntry in NIFi's truststore instead of the keystore. Also you want to make sure you have added NiFi's nodes public key as a trustedCertEntry on the ldaps server as well. If all you NIfi certs where signed by a CA, you just need to add the public key for your CA as a trustedCertEntry instead. Thank you, Matt

MattWho · ‎05-17-2017

@Muhammad Umar You have to remember that NiFi is being run by some user and that user is running all these processor components. NiFi's base directory is not going to be the same directory as where you point to the script. Try passing the absolute directory path to the script inside of NiFi instead of just "nifi-srcfiles/*" and "BackupFiles/". Thank you, Matt

MattWho · ‎05-16-2017

@bhumi limbu That is a very useful ERROR log message. You appear to be running in to teh following reported issue: https://issues.apache.org/jira/browse/NIFI-3096 Thanks, Matt

MattWho · ‎05-16-2017

@Gaurav Jain Was I able to successful answer your question? If so please mark the answer as accepted. Thank you, Matt

MattWho · ‎05-15-2017

@yeah thatguy 10K FlowFiles in NiFi is nothing in terms of load. NiFi processors use system threads to run. These processors can be configured with multiple "concurrent tasks". This allows one processor to essentially run multiple times at the exact same time. I would not however ever try to schedule one processor with 10,000 concurrent tasks (I don't know of any server that has 10,000 cpu cores.) Can you elaborate on your use case and why you must load all 10k files in parallel versus rapid succession? Processors are designed in a variety of ways depending on their function. Some processor work on one FlowFile at a time while other work on batches of FlowFiles. GetFile has a configurable BatchSize which controls the number for Files retrieved per processor execution. All Files are committed as FlowFile in nifi at the same time upon ingestion. You could configure smaller batches and multiple concurrent tasks on this processor. ListFile processor retrieve a complete listing of all Files in the target directory and then creates a single 0 byte FlowFile for each of them. The complete batch is committed to the success relationship at the same time. FetchFile processor retrieves the content of each of the listed files and inserts that content in to the FlowFile. This processor is a good candidate for multiple concurrent tasks. Each instance of NiFi runs in its own single JVM. Only FlowFile attributes live in JVM heap memory (FlowFile attributes are also persisted to disk). To help protect the JVM from OOM errors NiFi will swap FlowFiles to disk if a connections queue exceeds the configurable swapping threshold. The default swapping threshold is 20,000 and is set in the nifi.properties file. This setting is per connection and not for the entire NiFi dataflow(s). FlowFile Content is written to the NiFi content repository. It is then only accessed when a processor performs a function that requires it to read or modify that content. NiFi's JVM heap memory defaults to only 512 MB, but is configurable via NiFi's bootstrap.conf file. Thanks, Matt

MattWho · ‎05-15-2017

@Muhammad Umar When NiFi starts and has not been configured with a specific hostname or IP in the (nifi.web.http.host=) in the nifi.properties file, it looks to bind to the IP address registered to every NIC card present on the host system. If you try to specify a hostname or IP that does not resolve or match the IP registered to any of your NIC cards, NiFi will fail to start. NiFi can not bind to a port that belongs to an IP it does not own. You can run the "ifconfig" command on the host running NiFi to see all NICs and the IP registered to them. You should see the 172.17.x.x address and not the 192.168.x.x address shown. It definitely sounds like there is some network address translation going on here. The fact that you can reach NiFi over http://192.68.x.x:8078// confirms this. It is simply routing all traffic from the 192.169.x.x address to the internal 172.17.x.x address. We confirmed already your browser cannot resolve a path directly to 172.17.x.x, because if you could, NiFi's UI would have opened. NiFi is in fact bound to 172.17.x.x and not 192.168.x.x. NiFi cannot control how traffic is being routed to this endpoint by the network. Thanks, Matt

MattWho · ‎05-15-2017

@Gaurav Jain If the NiFi node suddenly goes down, how is it going to notify the other nodes of this? If the node goes down, the result of the job is neither a failure or success as NiFi defines it. The FlowFile that triggers your SparkJobExecutor should remain tied to the incoming connections until the job successfully completes or reports a failure. At that time the FlowFile would be moved to one of the corresponding relationship. If the NIFi node goes down, when it comes back up FlowFiles are restored to the last known connection. This means this FlowFile will trigger your SparkJob executor to run again. Are you looking for a way to notify another node that the last spark job did not complete only or are you also looking for a way for that other node to then run the job? That becomes even more difficult since you must also tell the node that went down not to run again next time it starts back up. As far as the notification goes, you might be able to build a flow using the new wait and notify processor just released in Apache NiFi 1.2.0. You could send a copy of your FlowFile to another one of your nodes before executing the Spark job and then send another FlowFile after job completes. On the other node it should receive the FlowFile and send it to a wait processor. The wait processor can be configured for a time limit. Should that time expire the FlowFile gets routed to expired relationship which can use run the job again on that node or simply send out a email alert. If the job completes before expiration time occurs, a FlowFile could be send to same node to notify successful completion which will cause wait processor to rout FlowFile to success relationship which you may choose to just terminate. Here is the doc for these new processors: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.Wait/index.html https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.Notify/index.html Bottom line, NiFi has no behind the scenes monitoring capability to accomplish what you are trying to do here, so a programmatic dataflow design must be used to meet this need. Now if you are talking about a node simply becoming disconnected from the cluster, that is a different story. Just because a node disconnects does not mean it shuts down or stops running its dataflows. It will continue to run as normal and constantly attempt to reconnect. Thanks, Matt

MattWho · ‎05-15-2017

Both your NiFi clusters can use teh same zookeeper, but you need to make sure each cluster is configured to use a different ZK root node. The root node is set in the nifi.properties file and the state-management.xml file.

MattWho · ‎05-15-2017

@Shengjie Min When backpressure kicks in due to some configured threshold being met on a connection , the source processor of that connection (Whether it is GetFile or QueryDatabaseTable) is no longer allowed to run until the threshold drops below the back pressure threshold. These thresholds are soft limits so as to not cause and data loss. lets assume you have an object threshold set to 100 on a connection and that connection currently has 99 FlowFiles. If the processor before processes batches of Files at a time, the entire batch will be processed and put on the connection. Lets say the batch was 100 FlowFiles. Then you connection would now have 199 FlowFiles queued on it. The source processor would now not be allowed to run again until the threshold dropped below 100 again because of your object threshold setting. Data ingested by NiFi is written to content repository and FlowFile Attributes about that content are written to the FlowFile repository. FlowFile also remains in heap memory space and is what is passed from processor to processor in your dataflow. To reduce the likelihood of NiFi running out of heap memory, NiFi is configured to swap FlowFile out of heap memory to disk should the number of FlowFiles queued on a single connection exceed a configurable value. The default swapping threshold is 20,000 FlowFile per connection and is set in the nifi.properties file. Heap memory usage is something every person who builds dataflows must take in to account. While FlowFiles in general use very little heap memory space, there is nothing that stops a user from designing a dataflow that writes a lot of FlowFile attributes. A user could use the extractText processor for example to read the entire content (data) of a FlowFile in to a NiFi FlowFile attribute. Depending on the size of the content, that FlowFile could get very large. By default, NiFi comes configured to use a heap of only 512 MB. This value can be adjusted in NiFi's bootstrap.conf file. Thank you, Matt

Online	Offline
Last Visited	‎11-17-2025 02:39 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎11-17-2025 02:39 PM
Posts	3,391
Kudos received	1614

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: Nifi Registry and LDAP

Re: NiFi logs not rolling over on Windows

Re: how to run processor once on many flowfiles

Re: NIFI LDAPS SEEMS TO FAIL

Re: ExecuteStreamCommand Not Working

Re: GetHTTP processor Error 401:unauthorized

Re: Spark job execution in Nifi Cluster.

Re: Handling 10k files in Nifi

Re: Unable to load the WEB UI of NIFI

Re: Spark job execution in Nifi Cluster.

Re: Run multiple NiFi versions in the same cluster...

Re: NiFi how to store the events in memory and dis...