Member since
07-30-2019
3406
Posts
1621
Kudos Received
1006
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 18 | 12-17-2025 05:55 AM | |
| 79 | 12-15-2025 01:29 PM | |
| 41 | 12-15-2025 06:50 AM | |
| 194 | 12-05-2025 08:25 AM | |
| 338 | 12-03-2025 10:21 AM |
05-15-2017
06:52 PM
1 Kudo
@yeah thatguy 10K FlowFiles in NiFi is nothing in terms of load. NiFi processors use system threads to run. These processors can be configured with multiple "concurrent tasks". This allows one processor to essentially run multiple times at the exact same time. I would not however ever try to schedule one processor with 10,000 concurrent tasks (I don't know of any server that has 10,000 cpu cores.) Can you elaborate on your use case and why you must load all 10k files in parallel versus rapid succession? Processors are designed in a variety of ways depending on their function. Some processor work on one FlowFile at a time while other work on batches of FlowFiles. GetFile has a configurable BatchSize which controls the number for Files retrieved per processor execution. All Files are committed as FlowFile in nifi at the same time upon ingestion. You could configure smaller batches and multiple concurrent tasks on this processor. ListFile processor retrieve a complete listing of all Files in the target directory and then creates a single 0 byte FlowFile for each of them. The complete batch is committed to the success relationship at the same time. FetchFile processor retrieves the content of each of the listed files and inserts that content in to the FlowFile. This processor is a good candidate for multiple concurrent tasks. Each instance of NiFi runs in its own single JVM. Only FlowFile attributes live in JVM heap memory (FlowFile attributes are also persisted to disk). To help protect the JVM from OOM errors NiFi will swap FlowFiles to disk if a connections queue exceeds the configurable swapping threshold. The default swapping threshold is 20,000 and is set in the nifi.properties file. This setting is per connection and not for the entire NiFi dataflow(s). FlowFile Content is written to the NiFi content repository. It is then only accessed when a processor performs a function that requires it to read or modify that content. NiFi's JVM heap memory defaults to only 512 MB, but is configurable via NiFi's bootstrap.conf file. Thanks, Matt
... View more
05-15-2017
05:27 PM
@Muhammad Umar When NiFi starts and has not been configured with a specific hostname or IP in the (nifi.web.http.host=) in the nifi.properties file, it looks to bind to the IP address registered to every NIC card present on the host system. If you try to specify a hostname or IP that does not resolve or match the IP registered to any of your NIC cards, NiFi will fail to start. NiFi can not bind to a port that belongs to an IP it does not own.
You can run the "ifconfig" command on the host running NiFi to see all NICs and the IP registered to them. You should see the 172.17.x.x address and not the 192.168.x.x address shown. It definitely sounds like there is some network address translation going on here. The fact that you can reach NiFi over http://192.68.x.x:8078// confirms this. It is simply routing all traffic from the 192.169.x.x address to the internal 172.17.x.x address. We confirmed already your browser cannot resolve a path directly to 172.17.x.x, because if you could, NiFi's UI would have opened. NiFi is in fact bound to 172.17.x.x and not 192.168.x.x. NiFi cannot control how traffic is being routed to this endpoint by the network. Thanks, Matt
... View more
05-15-2017
03:24 PM
@Gaurav Jain If the NiFi node suddenly goes down, how is it going to notify the other nodes of this? If the node goes down, the result of the job is neither a failure or success as NiFi defines it. The FlowFile that triggers your SparkJobExecutor should remain tied to the incoming connections until the job successfully completes or reports a failure. At that time the FlowFile would be moved to one of the corresponding relationship. If the NIFi node goes down, when it comes back up FlowFiles are restored to the last known connection. This means this FlowFile will trigger your SparkJob executor to run again. Are you looking for a way to notify another node that the last spark job did not complete only or are you also looking for a way for that other node to then run the job? That becomes even more difficult since you must also tell the node that went down not to run again next time it starts back up. As far as the notification goes, you might be able to build a flow using the new wait and notify processor just released in Apache NiFi 1.2.0. You could send a copy of your FlowFile to another one of your nodes before executing the Spark job and then send another FlowFile after job completes. On the other node it should receive the FlowFile and send it to a wait processor. The wait processor can be configured for a time limit. Should that time expire the FlowFile gets routed to expired relationship which can use run the job again on that node or simply send out a email alert. If the job completes before expiration time occurs, a FlowFile could be send to same node to notify successful completion which will cause wait processor to rout FlowFile to success relationship which you may choose to just terminate. Here is the doc for these new processors: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.Wait/index.html https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.2.0/org.apache.nifi.processors.standard.Notify/index.html Bottom line, NiFi has no behind the scenes monitoring capability to accomplish what you are trying to do here, so a programmatic dataflow design must be used to meet this need. Now if you are talking about a node simply becoming disconnected from the cluster, that is a different story. Just because a node disconnects does not mean it shuts down or stops running its dataflows. It will continue to run as normal and constantly attempt to reconnect. Thanks, Matt
... View more
05-15-2017
01:52 PM
1 Kudo
Both your NiFi clusters can use teh same zookeeper, but you need to make sure each cluster is configured to use a different ZK root node. The root node is set in the nifi.properties file and the state-management.xml file.
... View more
05-15-2017
12:37 PM
1 Kudo
@Shengjie Min When backpressure kicks in due to some configured threshold being met on a connection , the source processor of that connection (Whether it is GetFile or QueryDatabaseTable) is no longer allowed to run until the threshold drops below the back pressure threshold. These thresholds are soft limits so as to not cause and data loss. lets assume you have an object threshold set to 100 on a connection and that connection currently has 99 FlowFiles. If the processor before processes batches of Files at a time, the entire batch will be processed and put on the connection. Lets say the batch was 100 FlowFiles. Then you connection would now have 199 FlowFiles queued on it. The source processor would now not be allowed to run again until the threshold dropped below 100 again because of your object threshold setting. Data ingested by NiFi is written to content repository and FlowFile Attributes about that content are written to the FlowFile repository. FlowFile also remains in heap memory space and is what is passed from processor to processor in your dataflow. To reduce the likelihood of NiFi running out of heap memory, NiFi is configured to swap FlowFile out of heap memory to disk should the number of FlowFiles queued on a single connection exceed a configurable value. The default swapping threshold is 20,000 FlowFile per connection and is set in the nifi.properties file. Heap memory usage is something every person who builds dataflows must take in to account. While FlowFiles in general use very little heap memory space, there is nothing that stops a user from designing a dataflow that writes a lot of FlowFile attributes. A user could use the extractText processor for example to read the entire content (data) of a FlowFile in to a NiFi FlowFile attribute. Depending on the size of the content, that FlowFile could get very large. By default, NiFi comes configured to use a heap of only 512 MB. This value can be adjusted in NiFi's bootstrap.conf file. Thank you, Matt
... View more
05-15-2017
12:16 PM
1 Kudo
@frank chen The CN for your certificate should match the FQDN of the server where you installed NiFi. Using localhost in a certificate is never a good idea from a security standpoint. Alternatively you can create a certificate that uses SAN names. These SAN names should be DNS resolvable hostnames (With a SAN entry that matches the FQDN of the server the CN does not then need to contain the FQDN. While it is possible to add a security exception in your browser for this bad server cert, you will not be able to do this should you standup a NiFi cluster where the nodes talk securely to one another. I suggest using the toolkit to generate a certificate that uses the servers FQDN as both its CN and as a Subject Alternative Name (SAN) DNS entry. That aside, where did you get your user certificate that is being used to authenticate you as a user to access NiFi? You can use the tls-tookit to create a user certificate as well which you will need to load in your browser. Alternatively, you could configure NiFi to use an external LDAP server or kerberos for user authentication. When you access a secured NiFi instance/cluster URL, the server looks for a valid user certificate it can trust in the request. If no client cert is presented to authenticate with, NiFi will check to see fi any other authentication method has been configured in NiFi. If none have been configured the connection is closed. So your issue is one of the following: 1. Your browser is not have a client (user) certificate loaded to pass to NIFi for authentication. 2. You have a user certificate but it is not trusted by your NiFi instance/cluster. The entries in the NiFi truststore.jks are used to trust the client certificates presented. The keystore typically includes a bunch Certificate authority (CA) trustedCertEntries. It may also contain the public keys of self-signed certificates as trustedCertEntries. If you found this answer addressed your question, please don't forget to mark it "accepted". Thank you, Matt
... View more
05-15-2017
12:05 PM
2 Kudos
@Gaurav Jain
In a cluster, the only behind the scenes communications that occur are in the form of heartbeat messages sent from each node to the currently elected cluster coordinator. These heartbeat messages contain only health and status information. If the node running the spark job goes down, not only would the health and status messages stop, but there is nothing in that health and status message that would indicate the status of a currently executed spark job. The FlowFile that was used to trigger the spark job will be reloaded to the last queue it was in before the node went down. This means that when this node comes back online, it will trigger the same spark job to run again. Thank you, Matt
... View more
05-12-2017
06:55 PM
@Muhammad Umar Both ports 8078 and 8079 are likely not being forwarded by your HDP sandbox and will need to be added. https://hortonworks.com/hadoop-tutorial/sandbox-port-forwarding-guide/ Thanks, Matt
... View more
05-12-2017
06:48 PM
@bhumi limbu Is this error followed by a stack trace in the nifi-app.log?
... View more
05-12-2017
06:29 PM
@Muhammad Umar If the following command shows NiFi is still running: ./nifi.sh status and the following command shows NiFi listening on port 8079: netstat -ant|grep LISTEN Then the issue is not with NiFi. There is something external to NiFi blocking connections to port 8079 from your host where your browser is running. What OS version is running on the server/VM where NiFi is running? Thanks, Matt
... View more