Created on 10-08-201604:17 PM - edited 08-17-201909:15 AM
All you need is some basic programming skill and a little
experience with AWS to get things kick started.
Before we proceed, let me explain what actually is Hybrid CloudEnvironment?
Hybrid cloud solutions combine the dynamic scalability
of public cloud solutions with the flexibility and control of your private
cloud.
Some of the benefits of Hybrid Computing include -
Business Continuity
More Opportunity For Innovation
Scalability
Increased Speed To Market
Risk Management (test the waters)
Improved Connectivity
Secure Systems
Kafka
Mirroring
With Kafka mirroring feature you can maintain a replica of
you existing Kafka Cluster.
The following diagram shows how to use the MirrorMaker tool
to mirror a source Kafka cluster into a target (mirror) Kafka cluster.
The tool uses a Kafka consumer to consume
messages from the source cluster, and re-publishes those messages to the local
(target) cluster using an embedded Kafka producer
Use
Case
Demonstrate hybrid cloud solution using Kafka Mirroring across regions
Environment
Architecture
The architecture above represent two cluster environments, private and
public cloud respectively, where data is replicated from source Kafka cluster
to target Kafka cluster with the help of MirrorMaker tool and analysis over the
data sets is performed using Spark Streaming clusters.
The internal environment stores all the data in HDFS which is
accessible with Hive external tables. The purpose of storing data in HDFS is so
that at any given point of time the raw data is never changed and can be used
to tackle any discrepancies that might occur in the real time layer (target
cluster).
The external environment receives the replicated data with the help of
Mirror Maker and a spark streaming application is responsible to process that
data and store it into Amazon S3. The crucial data that requires low level
latency based on TTL is maintained in Amazon S3. The data is then pushed to
Amazon Redshift where the user can issue low latency queries and have the
results calculated on the go.
With the combine power of Hybrid Environment and Kafka mirroring you
can perform different types of data analysis over streams of data with low
latency
Technology
Stack
Source System – Twitter
Messaging System – Apache Kafka
Target System (Internal)– HDFS, Apache Hive
Target System (External) – Amazon S3 , Amazon
Redshift
AWS Instance Type - EMR
Streaming API – Apache Spark
Programming Language – Java
IDE – Eclipse
Build tool – Apache Maven
Operating System – CentOS 7
Workflow
A flume agent is configured to pull live data from
twitter and push it to source Kafka broker which is hosted internally.
The data is then replicated to a target Kafka
broker which is hosted on AWS with the help of Kafka Mirror Tool.
The data is then picked up from Kafka in the
internal environment by Spark Streaming application which stores all the data
over HDFS where you can query the data using Hive.
Parallel to this there is another spark
streaming application running in the cloud environment which reads the data
from the target Kafka cluster and stores it in Amazon S3 from where it is
pushed to Amazon Redshift.
Stay tuned for more amazing stuff and help the open-source community
to grow further by actively participating in the work we do to expand the
project.