About bbende

bbende · ‎03-09-2017

http://bryanbende.com/development/2016/08/30/apache-nifi-1.0.0-secure-site-to-site

bbende · ‎02-24-2017

Can you take a thread dump and provide the output here? ./bin/nifi.sh dump /path/to/output/dump.txt

bbende · ‎02-10-2017

@Raj B Thanks! Since the NCM didn't do any data processing there is actually not much different for this article between 0.x and 1.x. The only real difference is when setting up site-to-site connections you now create a remote process group and use the URL of any node in the cluster, where as before you entered the URL of the NCM. Other than that just pretend the NCM isn't in the diagrams and you should be good to go.

bbende · ‎12-02-2016

@Avijeet Dash There is really no correct answer for the architecture because it depends on the specs of the servers and on the amount of data moving through the system and what is being done to each piece of data. I can say that for a high volume production scenario, you would probably not want to co-locate the services on each node. The biggest impact on performance will likely be making sure each NiFi repository (flow file, content, prov) has its own disks, and Kafka has its own disks, to avoid I/O contention. NiFi currently doesn't do data replication, but there it is being worked on by the community: https://cwiki.apache.org/confluence/display/NIFI/Data+Replication Even without data replication, typically you would have a RAID configuration for your repository disks on each of your NiFi nodes so it would have to be some kind of error that went beyond just a single disk failure. As long as you can get the node back up then all the data will be there.

bbende · ‎12-01-2016

@Avijeet Dash Thanks! It really depends on what you are trying to do. NiFi and Kafka are serving two different purposes... NiFi is a data flow tool meant to move data between systems and provide centralized management of the data flow. All of the data in NiFi is persisted to disk and it will survive restarts/crashes. Kafka provides a durable stream store with a decentralized publish and subscribe model, where consumers can manage their own offsets and reset them to replay data. NiFi is not trying to hold on to the data, it is trying to bring it somewhere and once it is delivered to the destination, then it is no longer in NiFi. Where as Kafka typically holds on to the data longer which is what allows for consumers to reset their offsets and replay data. Also if you have many downstream consumers that are all consuming the same data, it makes more sense for consumers to latch on to a Kafka topic. NiFi does offer the ability for consumer to pull data via site-to-site, but you would need to setup an Output Port in NiFi for each of these consumers to pull from. So some examples... If you are trying to ingest data to HDFS, then NiFi can do that by itself. If you are trying to provide data to 10s or 100s of streaming analytics, then putting the data in Kafka makes sense, you may still want/need NiFi to get your data into Kafka. If you have data sources that you want to get into Kafka, but they can't be changed to communicate with Kafka, then use NiFi to reach out and get the data from those systems and publish it to Kafka.

bbende · ‎10-25-2016

I believe that "List Queue" would be a "View Data" policy on the source, and "Empty Queue" would be a "Modify Data" on the source component. Also keep in mind that if you are clustered, all of the nodes in the cluster also need to be part of this policy because all entities (users + machines) involved in the request need to be authorized for the data.

bbende · ‎10-25-2016

Well there would be a listener on each cluster node, but it is up to you to route the data to each of those listeners if you want to use them all. If you have a cluster of 3 NiFi nodes, and you setup syslog to push data to node 1 then you are only using the listener on node 1 and the other two listeners aren't doing anything. You would need to have the syslog agent distribute the data to all 3 listeners, or you would need to put a load balancer in front of NiFi and have the syslog agent send to the load balancer and the load balancer would distribute to the 3 nodes.

bbende · ‎10-25-2016

The Message Queue is in memory so anything in there would be lost if the node crashed. You could keep the Mx Size of Message Queue really small, possibly even set at 1, to avoid losing anything, but this may not work well for performance. You really need an application level protocol that can send acknowledgements back to the sender when data is successfully written to a flow file, if the sender never receives an ack then it can re-send. The is a ListenRELP processor that works does this, it is just like ListenTCP but the RELP protocol allows for acknowledgements.

bbende · ‎10-24-2016

As many listeners as cluster nodes, you would need to route the traffic to each node appropriately, one option being a load balancer in front that supports tcp or udp. The concurrent tasks only affects processing the messages that have already been read by the listener.

bbende · ‎09-19-2016

Introduction Apache Kafka is a high-throughput distributed messaging system that has become one of the most common landing places for data within an organization. Given that Apache NiFi's job is to bring data from wherever it is, to wherever it needs to be, it makes sense that a common use case is to bring data to and from Kafka. The remainder of this post will take a look at some approaches for integrating NiFi and Kafka, and take a deep dive into the specific details regarding NiFi's Kafka support. NiFi as a Producer A common scenario is for NiFi to act as a Kafka producer. With the advent of the Apache MiNiFi sub-project, MiNiFi can bring data from sources directly to a central NiFi instance, which can then deliver data to the appropriate Kafka topic. The major benefit here is being able to bring data to Kafka without writing any code, by simply dragging and dropping a series of processors in NiFi, and being able to visually monitor and control this pipeline. NiFi as a Consumer In some scenarios an organization may already have an existing pipeline bringing data to Kafka. In this case NiFi can take on the role of a consumer and handle all of the logic for taking data from Kafka to wherever it needs to go. The same benefit as above applies here. For example, you could deliver data from Kafka to HDFS without writing any code, and could make use of NiFi's MergeContent processor to take messages coming from Kafka and batch them together into appropriately sized files for HDFS. Bi-Directional Data Flows A more complex scenario could involve combining the power of NiFi, Kafka, and a stream processing platform to create a dynamic self-adjusting data flow. In this case, MiNiFi and NiFi bring data to Kafka which makes it available to a stream processing platform, or other analytic platforms, with the results being written back to a different Kafka topic where NiFi is consuming from, and the results being pushed back to MiNiFi to adjust collection. An additional benefit in this scenario is that if we need to do something else with the results, NiFi can deliver this data wherever it needs to go without having to deploy new code. NiFi's Kafka Integration Due to NiFi's isolated classloading capability, NiFi is able to support multiple versions of the Kafka client in a single NiFi instance. The Apache NiFi 1.0.0 release contains the following Kafka processors: GetKafka & PutKafka using the 0.8 client ConsumeKafka & PublishKafka using the 0.9 client ConsumeKafka_0_10 & PublishKafka_0_10 using the 0.10 client Which processor to use depends on the version of the Kafka broker that you are communicating with since Kafka does not necessarily provide backward compatibility between versions. For the rest of this post we'll focus mostly on the 0.9 and 0.10 processors. PublishKafka PublishKafka acts as a Kafka producer and will distribute data to a Kafka topic based on the number of partitions and the configured partitioner, the default behavior is to round-robin messages between partitions. Each instance of PublishKafka has one or more concurrent tasks executing (i.e. threads), and each of those tasks publishes messages independently. ConsumeKafka On the consumer side, it is important to understand that Kafka's client assigns each partition to a specific consumer thread, such that no two consumer threads in the same consumer group will consume from the same partition at the same time. This means that NiFi will get the best performance when the partitions of a topic can be evenly assigned to the concurrent tasks executing the ConsumeKafka processor. Lets say we have a topic with two partitions and a NiFi cluster with two nodes, each running a ConsumeKafka processor for the given topic. By default each ConsumeKafka has one concurrent task, so each task will consume from a separate partition as shown below. Now lets say we still have one concurrent task for each ConsumeKafka processor, but the number of nodes in our NiFi cluster is greater than the number of partitions in the topic. We would end up with one of the nodes not consuming any data as shown below. If we have more partitions than nodes/tasks, then each task will consume from multiple partitions. In this case, with four partitions and a two node NiFi cluster with one concurrent task for each ConsumeKafa, each task would consume from two partitions as shown below. Now if we have two concurrent tasks for each processor, then the number of tasks lines up with the number of partitions, and we get each task consuming from one partition. If we had increased the concurrent tasks, but only had two partitions, then some of the tasks would not consume any data. Note, there is no guarantee which of the four tasks would consume data in this case, it is possible it would be two tasks on the same node, and one node not doing anything. The take-away here is to think about the number of partitions vs. the number of consumer threads in NiFi, and adjust as necessary to create the appropriate balance. Security Configuring PublishKafka requires providing the location of the Kafka brokers and the topic name: Configuring ConsumeKafka also requires providing the location of the Kafka brokers, and supports a comma-separated list of topic names, or a pattern to match topic names: Both processors make it easy to setup any of the security scenarios supported by Kafka. This is controlled through the Security Protocol property which has the following options: PLAINTEXT SSL SASL_PLAINTEXT SASL_SSL When selecting SSL, or SASL_SSL, the SSL Context Service must be populated to provide a keystore and truststore as needed. When selecting SASL_PLAINTEXT, or SASL_SSL, the Kerberos Service Name must be provided, and the JAAS configuration file must be set through a system property in conf/bootstrap.conf with something like the following: java.arg.15=-Djava.security.auth.login.config=/path/to/jass-client.config Both processors also support user defined properties that will be passed as configuration to the Kafka producer or consumer, so any configuration that is not explicitly defined as a first class property can still be set. Performance Considerations In addition to configuring the number of concurrent tasks as discussed above, there are a couple of other factors that can impact the performance of publishing and consuming in NiFi. PublishKafka & ConsumeKafka both have a property called "Message Demarcator". On the publishing side, the demarcator indicates that incoming flow files will have multiple messages in the content, with the given demarcator between them. In this case, PublishKafka will stream the content of the flow file, separating it into messages based on the demarcator, and publish each message individually. When the property is left blank, PublishKafka will send the content of the flow file as s single message. On the consuming side, the demarcator indicates that ConsumeKafka should produce a single flow file with the content containing all of the messages received from Kafka in a single poll, using the demarcator to separate them. When this property is left blank, ConsumeKafka will produce a flow file per message received. Given that Kafka is tuned for smaller messages, and NiFi is tuned for larger messages, these batching capabilities allow for the best of both worlds, where Kafka can take advantage of smaller messages, and NiFi can take advantage of larger streams, resulting in significantly improved performance. Publishing a single flow file with 1 million messages and streaming that to Kafka will be significantly faster than sending 1 million flow files to PublishKafka. The same can be said on the consuming side, where writing a thousand consumed messages to a single flow file will produce higher throughput than writing a thousand flow files with one message each.

Online	Offline
Last Visited	‎09-10-2020 01:23 PM

Member Since	‎09-29-2015 04:02 PM
Last Visited	‎09-10-2020 01:23 PM
Posts	871
Kudos received	709

Cloudera Community

Re: How to use site-to-site over two nifi instance...

Re: Why PutSplunk stopped picking the data from Qu...

Re: How Do I Distribute Data Across an Apache NiFi...

Re: Integrating Apache NiFi and Apache Kafka

Re: Integrating Apache NiFi and Apache Kafka

Re: NIFI - policies for Connection

Re: Optimizing Performance of Apache NiFi's Networ...

Re: Optimizing Performance of Apache NiFi's Networ...

Re: Optimizing Performance of Apache NiFi's Networ...

Integrating Apache NiFi and Apache Kafka