I am trying to understand basic usage patterns for both, and would like create a simple checklist to see to answer the question? Something like:
- Is scale critical, particularly in terms of volume? Kafka
- Do I need/want to transform data on ingest? Nifi is perfect, to Kafka events are immutable payload
- Can a producer of msgs outpace and overwhelm a consumer? obviously Kafka to decouple them
- can the source be modified? with Kafka, the src must publish events, with Nifi no changes are required
I do get that in many scenarios, both will be appropriate. So just trying to get a good handle on strengths & weaknesses.
They are complimentary technologies, not really a "which is better"...
You may find this webinar helpful:
Agreed Bryan, I especially like slide #5 in your slide-share deck, as it defines a basic architecture for pretty much any IoT play. As simple as it looks, it begs a bunch of questions in my mind: for example why route events between the edge locations (I assume the blue box represents locations where the IoT events land in the customer network?) and the HDF/HDP cluster in the orange box? Seems like a msg broker makes snese here (Kafka) to guarantee delivery. I know this is an HDF 2 view, and so i'm just trying to hone in on a clear approach.
The main idea in that picture is that MiNiFi is running on the IoT devices, then sending back to say regional NiFi instances where potentially some routing/filtering/enrichment might be occurring, and then the regional instances are sending to a central NiFi that sits in front of Kafka, HDFS, or whatever else. This could obviously be adjusted depending on the use-case, maybe there are no regional instances and MiNiFi is sending directly into the NiFi at the main data center, whatever makes sense.
For the question about using a msg broker for guaranteed delivery, Are you asking about having the data go directly from the IoT device to Kafka? or possibly from MiNiFi then to Kafka?
In a lot of cases the data sources won't have the ability to send data to Kafka, likely because there is a large variety of devices that don't all speak the Kafka protocol. So MiNiFi gives you a tool to get the data off these data sources and back to a central location.
By sending from MiNiFi to a central NiFi then to Kafka, you have a nice way of managing the flow of data to Kafka. If you had 100 MiNiFi instances and they sent data directly to Kafka, and say you wanted to pause the flow to Kafka for a few minutes, you'd have to pause all 100 MiNiFi instances somehow, where as with the central NiFi instance you can just stop the PublishKafka processor for a few mins and then start it again, and MiNiFi never even knows this happened.
There shouldn't be any difference in terms of guaranteed delivery... MiNiFi talking site-to-site to NiFi is guaranteed delivery, and NiFi talking to Kafka via PublishKafka guaranteed delivery.
Also, if you want to play with the example at the end of the slide deck, I put together a VM that should make it easy to start up:
I think this is a good example architecture, although it could be made a lot better by integrating the schema registry and SAM, but those weren't available when this was created.