Member since
07-30-2019
93
Posts
96
Kudos Received
2
Solutions
05-17-2017
04:32 PM
2 Kudos
First you need to have Rapidminer downloaded and installed on your machine. https://my.rapidminer.com/nexus/account/index.html#downloads Once installed open Rapidminer and look at the list of operators. There is a link at the bottom left to "Get More Operators" Click the link then search for "Radoop" Select both packages and click install. After Rapidminer restarts you will see in the extensions folder the new operators we downloaded. Now we need to configure the connection. In the toolbar select "connections" then "Manage Radoop Connections" Select "+ New Connection". If you have your Hadoop config files available you can use those to set the properties otherwise select "manual" Select the hadoop version you have. In my case "Hortonworks 2.x" and supply the master url... If you have multiple masters select the check box and provide the details. Click "OK" Now click ">> Quick Test" If successful you are all set to read from Hive. Drag an "Radoop Nest" operator onto the canvas. Select the operator on the canvas and on the right hand side of the IDE select the connection we created earlier. Now double click the Radoop Nest operator to enter the nested canvas. Drag a "Retrieve from Hive" operator into the canvas, located in Radoop-->Data Access Click the operator and select a table that you wish to select. Connect the out port of the operator to the out port on the edge of the canvas by dragging from one to the other. Now click the Play button and wait for it to complete. Click the out port and select show sample data. Hope this was helpful! More to come on Rapidminer + Hortonworks ...
... View more
Labels:
02-16-2017
07:26 PM
3 Kudos
To begin log into ambari and from the Views section select workflow manager. Now select create new workflow. Give your workflow a name and then click on the line that connects the start and end nodes to add a step. Select the email step to add it to your flow and then select it and click the settings(gear) icon. Add your custom settings for all required fields and click save. Now we have a workflow capable of sending an email. That was easy and no XML needed to be modified(big reason many people have never used Oozie) Click Submit and provide a path (that doesn't already exist) for the workflow to be saved. Now go to the Dashboard and find your submitted workflow. You can click run from the dashboard to run the flow or you can select "run on submit" in the step before saving and submitting the flow.
... View more
Labels:
09-29-2016
01:59 AM
7 Kudos
One key feature in Apache Atlas is the ability to track data lineage in your Data Lake visually. This allows you to very quickly understand the lifecycle of your data and answer questions about where the data originated from and how it relates to other data in the Data Lake. To Illustrate this we will use our own twitter data to perform sentiment analytics on our tweets in Hive and see how this is reflected in Apache Atlas. By now you should have a working sandbox or HDP environment up and running with Atlas enabled. If not please take a look at the following tutorial to help get you started: Getting Started with Atlas in HDP 2.5 First we need to gather the data sets we will use in this tutorial. Log into twitter Click on your account settings at the top right Select "Your Twitter Data" from the list on the left side of your screen. Now enter your password and in the "Other Data" section at the bottom select "Twitter Archive" This might take a little time but you will get a link to download your archive soon. While you wait on that data lets quickly grab the sentiment library we will use in this tutorial. Here is the zip you will need to download. AFINN Data In it we will need the AFINN-111.txt file. Now that we have the data let's go to the Hive View through Ambari and click the "Upload Table" link. Now just navigate to the tweets.csv file that is located in your twitter archive. ***If you need a twitter dataset to use for this step I have made mine public here : DataSample You will need to click the small gear icon next to the file type to specify the header row exists. Now upload the table and repeat the steps for the AFINN-111.txt file. Name it sentiment_dictionary to work with the rest of this tutorial. Make the column names "word" and "rating" Now that we have the required data let's perform some transformations in Hive. Back in Ambari in your Hive View open up a new query window. The sentiment analysis has been adapted from the article here : https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/ to fit this tutorials dataset. Create a table to store the words in our tweet text as an array: CREATE TABLE words_array AS SELECT tweet_id AS id, split(text,' ') AS words FROM tweets; Create a table that explodes the array into individual words: CREATE TABLE tweet_word AS SELECT id AS id, word FROM words_array LATERAL VIEW explode(words) w as word; Now JOIN the sentiment_dictionary to the tweet_word table CREATE TABLE word_join AS SELECT tweet_word.id, tweet_word.word, sentiment_dictionary.rating FROM tweet_word LEFT OUTER JOIN sentiment_dictionary ON (tweet_word.word=sentiment_dictionary.word); Great! Now we have each word rated for sentiment. The range is from -5 to +5. From here what you decide to do with this data is a topic for a different article; however, now that we have created the word_join table we can jump back to Atlas to inspect the lineage information associated to our new dataset. In the Atlas UI search for word_join Notice the connections to all parent tables and the recording of the actual SQL statements we executed during the transformations.
... View more
Labels:
04-24-2018
09:04 PM
Keep in mind, Taxonomy feature is still in Tech Preview (ie. not recommended for production use) and will not be supported. Taxonomy will be production ready or GA in HDP 3.0
... View more
09-21-2016
04:29 PM
5 Kudos
Once you have Atlas up and running( see this for getting started. ) you will want to create your first tag and use it to tag data to begin exploring what Atlas can do. Lets start by creating a tag that will be called "PII" which we will later use it to tag Personally Identifiable data. Log into Atlas and select the + icon on the homepage. Enter "PII" for the tag name and click create. That's it! Now we have a tag to use. Now click on the search tab select "dsl" and from the drop down select "hive_table" hit enter and select the "customer" table. You should see the summary details for the customer table in hive. Select the schema tab. Here you see all columns available in the customer data table. Let's mark the "account_num" field as "PII". Next to "account_num" click the + icon and select "PII" from the drop down. Now the column has been tagged and both searchable from Atlas as well as configured to be administered through Hive for auditing and permissions. Click the "Tags" tab in Atlas and search for "PII" to see your field show up.
... View more
Labels:
10-11-2016
02:06 PM
@Vasilis Vagias When i login to ambari as holger_gov, i have acces to hive data through the hive view. I also have Ranger tag sync service and HBase region server running. Still it does not work...
... View more
01-15-2019
05:22 AM
Hi Vasilis, does this method you have outlined only work when R is installed on an Edge node of the HDP cluster (i.e. R and HDFS are colocated)? I'm exploring how R (say installed in a workstation) can connect to HDFS running on a separate/remote server(s), in which case, I'm unsure how to define the connection details to Hadoop. Are you able to assist?
... View more
05-09-2016
05:11 PM
2 Kudos
A question encountered by most organizations when attempting to take on new Big Data Initiatives or adapt current operations to keep up with the pace of innovation is What approach to take in architecting their streaming workflows and applications.
Because Twitter data is free and easily accessible I chose to examine twitter events in this article; however, twitter events are simply json messages and can be conceptualized as machine generated json packets for a more generalized streaming architecture.
Lets move forward with an evaluation of NiFi vs Python ingesting streaming events from twitter.
When approaching a problem there are often several tools and methods that could ultimately yield similar results.
In this article I will look at the problem of ingesting streaming twitter events and show how HDF and python can be used to
create an application which achieves the goal.
I will first lay out a solution using both platforms, then I will
point out different aspects and considerations.
To ingest data from Twitter, regardless of what solution you choose, API tokens are first required.
Get those here:
Twitter Developer Console
First lets look at Python:
You can get the source for the basic ingestion application here:
Twitter Python Example
Tweepy is used to connect to twitter and begin receiving status objects. import tweepy
import threading, logging, time
import string
######################################################################
# Authentication details. To obtain these visit dev.twitter.com
######################################################################
consumer_key = '<insert your consumer key>'
consumer_secret = '<insert your consumer secret>'
access_token = '<insert your access token>'
access_token_secret = '<insert your access secret>'
Now to create a connection we initiate the connection in the main loop: if __name__ == '__main__':
listener = StdOutListener()
#sign oath cert
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
#uncomment to use api in stream for data send/retrieve algorithms
#api = tweepy.API(auth)
stream = tweepy.Stream(auth, listener)
######################################################################
#Sample delivers a stream of 1% (random selection) of all tweets
######################################################################
client = KafkaClient("localhost:9092")
producer = SimpleProducer(client)
stream.sample()
And to begin parsing the messages we must do so programmatically and any changes must be made in the source code.
Now lets examine the process in NiFi. To avoid redundancy look at this thorough tutorial which illustrates the use of NiFi to ingest streaming twitter data. Twitter NiFi Example
To create the connection we simply enter the API credentials into the twitter operator shown below: Then we can write these JSON objects to HDFS using the PutHDFS operator or we can parse out any information of interest while the message is in flight using the EvaluateJsonPath operator. Looking at these 2 approaches you can probably already see the benefits of choosing Nifi, Which is part of Hortonworks HDF "Data in Motion" platform.
With NiFi our architecture is not limited and can be easily maintained and extended.
Note that to write to Kafka we can simply add a PutKafka operator that branches the stream into a PutKafka operator and leave our original workflow intact.
Also note the degree of programming knowledge required to manage and extend the Python application where NiFi can be managed visually with no need for code. The key takeaways here are the following: When making decisions about the technology to use in Big Data initiatives for streaming analytics platforms in an organization you should put a great deal of thought into: 1. Ease of implementation 2. Ease of extensibility 3. Openness of the technology The third point can't be emphasized enough. One of the biggest obstacles that organizations face when attempting to implement or extend their existing analytics applications is the lack of openness and the lock in caused by past technology decisions. Choosing to go with platforms like Hortonworks HDF and HDP allows for a complementary solution to existing architectures and technology that is already present in the organization and also leaves the architecture open to future innovations and allows the architecture and organization to keep up with the speed of innovation.
... View more
Labels:
05-02-2016
08:56 PM
Love this!! Already sent it to some close sales reps for a good laugh 🙂 Great job Dan!
... View more
02-15-2017
11:11 AM
Thank you @Ali Bajwa for good tutoral. I am trying this example with a difference, My nifi is local and I try to put tweets in a remote Solr. Solr is in a VM that contains Hortonworks sandbox. Unfortunately I am getting this error on PutSolrContentStream processor: PutSolrContentStream[id=f6327477-fb7d-4af0-ec32-afcdb184e545] Failed to send StandardFlowFileRecord[uuid=9bc39142-c02c-4fa2-a911-9a9572e885d0,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1487148463852-14, container=default, section=14], offset=696096, length=2589],offset=0,name=103056151325300.json,size=2589] to Solr due to org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://172.17.0.2:8983/solr/tweets_shard1_replica1; routing to connection_failure: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://172.17.0.2:8983/solr/tweets_shard1_replica1; Could you help me? thanks, Shanghoosh
... View more
- « Previous
-
- 1
- 2
- Next »