Created on 03-13-201810:16 PM - edited 09-16-202201:42 AM
The concept of time is at the core of all Big
Data processing technologies but is particularly important in the world of data
stream processing. Indeed, it is reasonable to say that the way in which
different systems handle time-based processing is what differentiates the wheat
from the chaff as it were, at least in the world of real-time stream
The demand for stream processing is increasing a
lot these days. A common need across Hadoop projects is to build up-to-date
indicators from streaming data.
Media analysis is a great use case for show how we can build a dashboard
showing streaming analytics with NiFi, Kafka, Tranquility, Druid, and Superset
This processing flow has these steps:
Tweets ingestion using Apache NiFi
Stream processing using Apache Kafka
Integrating data with Tranquility
OLAP database storage using Druid
Visualization using Apache Superset
putting our hands on coding, take a look on each component:
This process should give us a streaming
message like this:
"created_time":"Thu Mar 08 19:05:25 +0000 2018",
"time_zone":"São Paulo - Brasil",
Kafka is a real-time stream processor that
uses the publish-subscribe message pattern. We will use Kafka to receive
incoming messages and publish them to a specific topic-based queue (twitter_demo)
that Druid will subscribe to. Tranquility (Druid indexer) will read off these
messages and insert them into Druid database.
Use the below commands create a Kafka topic called
Now it’s time to get some
tranquility - sorry for wordplay!
Tranquility is a friend of Druid and helps us send event streams to
Druid in real-time. It handles partitioning, replication, service discovery,
and schema rollover for us, seamlessly and without downtime. Tranquility is
written in Scala, and bundles idiomatic Java and Scala APIs that work nicely
with Finagle, Samza, Spark, Storm, and Trident.
Tranquility Kafka is an application which simplifies the ingestion of
data from Kafka. It is scalable and highly available through the use of Kafka
partitions and consumer groups, and can be configured to push data from
multiple Kafka topics into multiple Druid dataSources.
First things first: To read from a Kafka stream we will define a configuration
file to describe a data source name, a Kafka topic to read from, and some
properties of the data that we read. Save the below JSON configuration
This instructs Tranquility to read from the
topic “twitter_demo” and push the messages that it receives into a
Druid data source called “twitter_demo”. In the messages it reads
Tranquility uses the _submission_time column (or key) to represent
the time stamp.
Druid is a rockin' exploratory analytical data store
capable of offering interactive query of big data in realtime.
In HDP/HDF Druid can be used easily through SuperSet,
we can build our Druid Datasource and manage all druid columns to fit our json
Phase 5: Superset Dashboard
We can use Superset for exploratory analysis and to define the JSON queries that we will execute against the Druid API and use to build our dashboard.
Once your druid Data Source has been created, you can create your slices and put them all in your dashboard.
Some pictures are worth a thousand words:
1)creating our slices
3)Saving all slices in our dashboard
4)Presenting our dashboard
In the end, we can see a great
real-time twitter dashboard with information about location, maps, languages,
and with a little more endeavor, we could even read each tweet individually to
see what is our customer sentimental analysis... but this is matter for next