Support Questions
Find answers, ask questions, and share your expertise

Best IoT ingestion tool for feeding data to Spark?

I have 10000 devices sending messages with AMQP protocol to cloud. We have 100million to 700 million message per day. Target is in real-time analytics. What is alternatives for Hortonworks compatible ingestion tool to ingest data? I like to have all possible tools to table and evaluate pros and cons.


Super Guru

Here is an example of doing it with NIFI.


NiFi + Kafka + Spark Streaming

ProductDifficulty / CodingBenefits
Apache NiFi (part of HDF)Easy / UI DrivenFast, easy, extendible, clustered, no coding, works with 190+ sources/sinks/transformers
Apache SparkScala, Python, Java, R Complex CodingFast, flexible, does ML/Streaming/Batch/SQL in one unified platform. You must write code. You can interface Spark to NiFi via Kafka or Site-To-Site.
Apache FlinkScala, JavaSimilar to Spark, but possibly faster and more real-time, but far fewer supporters and users and packages. You can interface Flink to NiFi via Kafka or Site-To-Site.
Apache StormJavaPowerful streaming language that interfaces to NiFi via Kafka or Site-To-Site. Very mature solution with dashboards and UI administration.

Super Guru

@Kenny Ikeda

Andrew and Tim have already pointed you to Hortonworks Data Flow (powered by Apache Nifi). It supports consuming AMQP messages as well as all other protocols including MQTT, TCP and so on. Link here (for AMQP processor) and here (for all processors) . There are few major challenges in data ingestion for such high volume use cases like IoT. Let me give you an overview first:

1. Challenge to capture data from devices. In your case it's initially only 10K devices but I am assuming over time this will grow.

2. Challenge for bidirectional data transfer. While majority of the time data will be ingested from the devices, is it possible that you might have to send message back to the device? An alert, update or anything else?

3. If devices are spread over the country, then how do you move data back to your data center. Assume that you have devices in West, Central and East region. Then may be you'll have a cloud environment in West to capture data in west, cloud environment in east to capture data in east region and then may be your data center in central region is where you have Hadoop cluster where you finally persist data.

4. Like you mention, you might like to do some transformation and analytics as the data is flowing through. That is, doing analytics even before the data has been persisted.

5. Finally, is security important? how about lineage and information about where the data was originated and who touched it along the way?

This is precisely where Hortonworks Data Flow comes in. It packages Apache Nifi, to capture data from devices as well as enable bidirectional data transfer, perform analytics on streaming data by providing Apache Storm and providing Apache Kafka as its messaging system so you can move data to multiple locations. This means data cannot only be sent to Hadoop but also other systems/applications within the organization. This is very important distinction. Hortonworks data flow is totally independent of Hadoop (or HDP). Hadoop is just one of the many systems which might consume from HDF. All other applications and systems can connect to and get data through HDF (Kafka or Nifi or Storm - it's extremely flexible based on your use case).

All this is implemented in a secure fashion with full data lineage information available, ability to replay (very useful in debugging), back pressure and queuing capabilities.

Please feel free to ask if you have any followup questions.

New Contributor

@Kenny Ikeda

Try below combination:

Apache Flume + Kafka + Spark Streaming

methinks that it's worth a consideration but I am not sure about its compatibility with Hortonworks!!

As mqureshi listed down some of the main challenges which should be listed down by you before choosing any tool. I will say that if you are open to using any tool then you can try Apache flume with Kafka and spark streaming (Apache flume + Kafka + spark streaming). Even though flume can be used directly with spark streaming as source, but for the reliability purpose, Kafka can be used in between. You can also use flume as your sink too for returning the analytics result for storing in the DC via Kafka to flume. Or you can directly store the result by consuming data from Kafka broker!!

oh we test all this; Kafka + Spark streaming is a good combination. Flume works well for bridging the output of logging systems to Kafka; Kafka is now the defacto queue platform for Hadoop-based applications

The other thing to consider is using cloud infrastructure eventing as the protocol to forward data to a hadoop/spark/storm cluster running there. This is something you can with Microsoft Azure Event Hubs and HD/Insight; there's docs on the azure site on the topic. Advantage: someone else has the task of keeping the ingress service up. Disadvantage: you need to all dev/test in the cloud.

What you do get with nifi, independent of the actual binding to the hadoop stack, is the ability to design the data collection process all the way up to the collectors, including setting policy there on what to do when there's too much data being collected. That's important when you think of embedded uses where the bandwidth between the collector and the cloud infrastructure is variable: in times of low bandwidth you want your vehicle to focus on uploading position and velocity information, not wear-and-tear events from the shock absorbers. If you leave out that filtering to the server-side infra, you miss out the ability to control a chunk of the problem