Created on 05-09-2016 05:11 PM - edited 08-17-2019 12:32 PM
A question encountered by most organizations when attempting to take on new Big Data Initiatives or adapt current operations to keep up with the pace of innovation is What approach to take in architecting their streaming workflows and applications.
Because Twitter data is free and easily accessible I chose to examine twitter events in this article; however, twitter events are simply json messages and can be conceptualized as machine generated json packets for a more generalized streaming architecture.
Lets move forward with an evaluation of NiFi vs Python ingesting streaming events from twitter.
When approaching a problem there are often several tools and methods that could ultimately yield similar results.
In this article I will look at the problem of ingesting streaming twitter events and show how HDF and python can be used to create an application which achieves the goal.
I will first lay out a solution using both platforms, then I will point out different aspects and considerations.
To ingest data from Twitter, regardless of what solution you choose, API tokens are first required. Get those here:
First lets look at Python: You can get the source for the basic ingestion application here:
Tweepy is used to connect to twitter and begin receiving status objects.
import tweepy import threading, logging, time import string ###################################################################### # Authentication details. To obtain these visit dev.twitter.com ###################################################################### consumer_key = '<insert your consumer key>' consumer_secret = '<insert your consumer secret>' access_token = '<insert your access token>' access_token_secret = '<insert your access secret>'
Now to create a connection we initiate the connection in the main loop:
if __name__ == '__main__': listener = StdOutListener() #sign oath cert auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) #uncomment to use api in stream for data send/retrieve algorithms #api = tweepy.API(auth) stream = tweepy.Stream(auth, listener) ###################################################################### #Sample delivers a stream of 1% (random selection) of all tweets ###################################################################### client = KafkaClient("localhost:9092") producer = SimpleProducer(client) stream.sample()
And to begin parsing the messages we must do so programmatically and any changes must be made in the source code.
Now lets examine the process in NiFi.
To avoid redundancy look at this thorough tutorial which illustrates the use of NiFi to ingest streaming twitter data.
To create the connection we simply enter the API credentials into the twitter operator shown below:
Then we can write these JSON objects to HDFS using the PutHDFS operator or we can parse out any information of interest while the message is in flight using the EvaluateJsonPath operator.
Looking at these 2 approaches you can probably already see the benefits of choosing Nifi, Which is part of Hortonworks HDF "Data in Motion" platform.
With NiFi our architecture is not limited and can be easily maintained and extended.
Note that to write to Kafka we can simply add a PutKafka operator that branches the stream into a PutKafka operator and leave our original workflow intact. Also note the degree of programming knowledge required to manage and extend the Python application where NiFi can be managed visually with no need for code.
The key takeaways here are the following:
When making decisions about the technology to use in Big Data initiatives for streaming analytics platforms in an organization you should put a great deal of thought into:
1. Ease of implementation
2. Ease of extensibility
3. Openness of the technology
The third point can't be emphasized enough. One of the biggest obstacles that organizations face when attempting to implement or extend their existing analytics applications is the lack of openness and the lock in caused by past technology decisions.
Choosing to go with platforms like Hortonworks HDF and HDP allows for a complementary solution to existing architectures and technology that is already present in the organization and also leaves the architecture open to future innovations and allows the architecture and organization to keep up with the speed of innovation.