Created 07-26-2016 11:13 AM
Currently we are looking to use Nifi to get Json events from kafka and store them to s3. We are looking to set up a nifi cluster and i have following questions please?
1. What happens to Nifi when I do a rolling restart of kafka? I think nifi listens to zookeeper :2181 or kafka nodes how do we configure it in GetKafka processor?
2. What does Nifi do if the peered connection becomes unavailable?
3. How does it handle a kafka node going offline?
4. Does it always connect to the same kafka node or what determines which node to connect to?
5. Does it connect on IP or DNS (kafka IP's can change)? Can we make Nifi to connect to kafka DNS rather than IP.
6. What will it do if a leadership election is invoked while its consuming?
7. What monitoring and alerting will be in place in production?
8. Any issues if Zookeeper embedded in slaves and how Zookeeper maintains the state integrity if we have more than 1 zookeeper in the nifi cluster? Or is it better to have a separate single ZK instance in cluster? thanks
Created 07-26-2016 12:17 PM
The GetKafka processor currently uses the 0.8 (old style) High Level consumer API for Kafka. This means it will use Zookeeper for discovery of the Kafka brokers. These will be based on the host names advertised (advertised.hosts.name) by your Kafka brokers, so should be DNS based. It also means that offsets read are stored in the Kafka zookeeper per consumer group id, as specified in the NiFi processor. So from a Kafka point of view, NiFi is just like any other Kafka consumer, and will connect, balance, and retry in the normal way.
From a Kafka point of view the Nifi zookeeper is not used at all. However, for NiFi leader (primary) elections the internal zookeeper, or the external one if configured that way will be used. In a scenario where you have an existing zookeeper cluster, for example for your Kafka install, you will find it simpler to just use that, since the NiFi load on ZK is usually fairly light (unless you make extensive use of the State api for things like QueryDatabaseTable for example). Never use a single zookeeper node in production, three at least.
For monitoring of the NiFi cluster, it's worth considering using the AmbariReportingService to integrate your monitoring into Ambari. Note that there are also multiple sources of monitoring stats within NiFi which can be monitored for your particular data flows.
Created 07-26-2016 12:17 PM
The GetKafka processor currently uses the 0.8 (old style) High Level consumer API for Kafka. This means it will use Zookeeper for discovery of the Kafka brokers. These will be based on the host names advertised (advertised.hosts.name) by your Kafka brokers, so should be DNS based. It also means that offsets read are stored in the Kafka zookeeper per consumer group id, as specified in the NiFi processor. So from a Kafka point of view, NiFi is just like any other Kafka consumer, and will connect, balance, and retry in the normal way.
From a Kafka point of view the Nifi zookeeper is not used at all. However, for NiFi leader (primary) elections the internal zookeeper, or the external one if configured that way will be used. In a scenario where you have an existing zookeeper cluster, for example for your Kafka install, you will find it simpler to just use that, since the NiFi load on ZK is usually fairly light (unless you make extensive use of the State api for things like QueryDatabaseTable for example). Never use a single zookeeper node in production, three at least.
For monitoring of the NiFi cluster, it's worth considering using the AmbariReportingService to integrate your monitoring into Ambari. Note that there are also multiple sources of monitoring stats within NiFi which can be monitored for your particular data flows.
Created 07-26-2016 12:54 PM
@Simon Elliston Ball thanks for the answer 🙂 If we install multiple ZK that means we need to embed it with nifi on the same node. If so how do we sync ZK state from different slaves nodes where miltuple ZK instances are installed ? Thank you.
Created 07-26-2016 01:01 PM
You can use the embedded Zookeepers in the Nifi process to create a zookeeper cluster, or you can use an external zookeeper cluster (either the one you're using for Kafka, or a different one). Zookeeper synchronises the state within it's cluster.