Member since
07-30-2019
93
Posts
96
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1297 | 08-12-2016 02:57 PM | |
1798 | 05-02-2016 09:14 PM |
09-21-2016
04:29 PM
5 Kudos
Once you have Atlas up and running( see this for getting started. ) you will want to create your first tag and use it to tag data to begin exploring what Atlas can do. Lets start by creating a tag that will be called "PII" which we will later use it to tag Personally Identifiable data. Log into Atlas and select the + icon on the homepage. Enter "PII" for the tag name and click create. That's it! Now we have a tag to use. Now click on the search tab select "dsl" and from the drop down select "hive_table" hit enter and select the "customer" table. You should see the summary details for the customer table in hive. Select the schema tab. Here you see all columns available in the customer data table. Let's mark the "account_num" field as "PII". Next to "account_num" click the + icon and select "PII" from the drop down. Now the column has been tagged and both searchable from Atlas as well as configured to be administered through Hive for auditing and permissions. Click the "Tags" tab in Atlas and search for "PII" to see your field show up.
... View more
Labels:
10-11-2016
02:06 PM
@Vasilis Vagias When i login to ambari as holger_gov, i have acces to hive data through the hive view. I also have Ranger tag sync service and HBase region server running. Still it does not work...
... View more
07-06-2018
10:41 AM
Just like to say a big thank you to Michael Young for fixing this problem for me on the 2.6.5 sandbox. BTW: How would one get Ambari Infra to start every time before Atlas?
... View more
08-26-2016
05:14 PM
That did the trick! Thanks @Constantin Stanca!
... View more
08-12-2016
02:57 PM
You must verify that the solrconfig.xml has valid XML format and isn't missing entities. You can use this website to validate the XML : XML Validator You can also get a fresh copy of the solrconfig file from github if you have modified the original and don't have a backup. solrconfig.xml
... View more
01-15-2019
05:22 AM
Hi Vasilis, does this method you have outlined only work when R is installed on an Edge node of the HDP cluster (i.e. R and HDFS are colocated)? I'm exploring how R (say installed in a workstation) can connect to HDFS running on a separate/remote server(s), in which case, I'm unsure how to define the connection details to Hadoop. Are you able to assist?
... View more
05-09-2016
05:11 PM
2 Kudos
A question encountered by most organizations when attempting to take on new Big Data Initiatives or adapt current operations to keep up with the pace of innovation is What approach to take in architecting their streaming workflows and applications.
Because Twitter data is free and easily accessible I chose to examine twitter events in this article; however, twitter events are simply json messages and can be conceptualized as machine generated json packets for a more generalized streaming architecture.
Lets move forward with an evaluation of NiFi vs Python ingesting streaming events from twitter.
When approaching a problem there are often several tools and methods that could ultimately yield similar results.
In this article I will look at the problem of ingesting streaming twitter events and show how HDF and python can be used to
create an application which achieves the goal.
I will first lay out a solution using both platforms, then I will
point out different aspects and considerations.
To ingest data from Twitter, regardless of what solution you choose, API tokens are first required.
Get those here:
Twitter Developer Console
First lets look at Python:
You can get the source for the basic ingestion application here:
Twitter Python Example
Tweepy is used to connect to twitter and begin receiving status objects. import tweepy
import threading, logging, time
import string
######################################################################
# Authentication details. To obtain these visit dev.twitter.com
######################################################################
consumer_key = '<insert your consumer key>'
consumer_secret = '<insert your consumer secret>'
access_token = '<insert your access token>'
access_token_secret = '<insert your access secret>'
Now to create a connection we initiate the connection in the main loop: if __name__ == '__main__':
listener = StdOutListener()
#sign oath cert
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
#uncomment to use api in stream for data send/retrieve algorithms
#api = tweepy.API(auth)
stream = tweepy.Stream(auth, listener)
######################################################################
#Sample delivers a stream of 1% (random selection) of all tweets
######################################################################
client = KafkaClient("localhost:9092")
producer = SimpleProducer(client)
stream.sample()
And to begin parsing the messages we must do so programmatically and any changes must be made in the source code.
Now lets examine the process in NiFi. To avoid redundancy look at this thorough tutorial which illustrates the use of NiFi to ingest streaming twitter data. Twitter NiFi Example
To create the connection we simply enter the API credentials into the twitter operator shown below: Then we can write these JSON objects to HDFS using the PutHDFS operator or we can parse out any information of interest while the message is in flight using the EvaluateJsonPath operator. Looking at these 2 approaches you can probably already see the benefits of choosing Nifi, Which is part of Hortonworks HDF "Data in Motion" platform.
With NiFi our architecture is not limited and can be easily maintained and extended.
Note that to write to Kafka we can simply add a PutKafka operator that branches the stream into a PutKafka operator and leave our original workflow intact.
Also note the degree of programming knowledge required to manage and extend the Python application where NiFi can be managed visually with no need for code. The key takeaways here are the following: When making decisions about the technology to use in Big Data initiatives for streaming analytics platforms in an organization you should put a great deal of thought into: 1. Ease of implementation 2. Ease of extensibility 3. Openness of the technology The third point can't be emphasized enough. One of the biggest obstacles that organizations face when attempting to implement or extend their existing analytics applications is the lack of openness and the lock in caused by past technology decisions. Choosing to go with platforms like Hortonworks HDF and HDP allows for a complementary solution to existing architectures and technology that is already present in the organization and also leaves the architecture open to future innovations and allows the architecture and organization to keep up with the speed of innovation.
... View more
Labels:
10-19-2018
07:42 AM
Update: The tag propagation feature (ATLAS-1821) was released as part of HDP 3.0.0.
... View more
05-02-2016
08:56 PM
Love this!! Already sent it to some close sales reps for a good laugh 🙂 Great job Dan!
... View more
04-14-2016
03:55 PM
Additionally, you can find links to the complete set of HDF 1.2 repo locations in the HDF 1.2 Release Notes: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_RelNotes/content/ch_hdf_relnotes.html#hdf_repo
... View more
- « Previous
- Next »