About vnv

vnv · ‎09-21-2016

Once you have Atlas up and running( see this for getting started. ) you will want to create your first tag and use it to tag data to begin exploring what Atlas can do. Lets start by creating a tag that will be called "PII" which we will later use it to tag Personally Identifiable data. Log into Atlas and select the + icon on the homepage. Enter "PII" for the tag name and click create. That's it! Now we have a tag to use. Now click on the search tab select "dsl" and from the drop down select "hive_table" hit enter and select the "customer" table. You should see the summary details for the customer table in hive. Select the schema tab. Here you see all columns available in the customer data table. Let's mark the "account_num" field as "PII". Next to "account_num" click the + icon and select "PII" from the drop down. Now the column has been tagged and both searchable from Atlas as well as configured to be administered through Hive for auditing and permissions. Click the "Tags" tab in Atlas and search for "PII" to see your field show up.

florissmit10 · ‎10-11-2016

@Vasilis Vagias When i login to ambari as holger_gov, i have acces to hive data through the hive view. I also have Ranger tag sync service and HBase region server running. Still it does not work...

dave_dews · ‎07-06-2018

Just like to say a big thank you to Michael Young for fixing this problem for me on the 2.6.5 sandbox. BTW: How would one get Ambari Infra to start every time before Atlas?

vnv · ‎08-26-2016

That did the trick! Thanks @Constantin Stanca!

vnv · ‎08-12-2016

You must verify that the solrconfig.xml has valid XML format and isn't missing entities. You can use this website to validate the XML : XML Validator You can also get a fresh copy of the solrconfig file from github if you have modified the original and don't have a backup. solrconfig.xml

ndj2006 · ‎01-15-2019

Hi Vasilis, does this method you have outlined only work when R is installed on an Edge node of the HDP cluster (i.e. R and HDFS are colocated)? I'm exploring how R (say installed in a workstation) can connect to HDFS running on a separate/remote server(s), in which case, I'm unsure how to define the connection details to Hadoop. Are you able to assist?

vnv · ‎05-09-2016

A question encountered by most organizations when attempting to take on new Big Data Initiatives or adapt current operations to keep up with the pace of innovation is What approach to take in architecting their streaming workflows and applications. Because Twitter data is free and easily accessible I chose to examine twitter events in this article; however, twitter events are simply json messages and can be conceptualized as machine generated json packets for a more generalized streaming architecture. Lets move forward with an evaluation of NiFi vs Python ingesting streaming events from twitter. When approaching a problem there are often several tools and methods that could ultimately yield similar results. In this article I will look at the problem of ingesting streaming twitter events and show how HDF and python can be used to create an application which achieves the goal. I will first lay out a solution using both platforms, then I will point out different aspects and considerations. To ingest data from Twitter, regardless of what solution you choose, API tokens are first required. Get those here: Twitter Developer Console First lets look at Python: You can get the source for the basic ingestion application here: Twitter Python Example Tweepy is used to connect to twitter and begin receiving status objects. import tweepy import threading, logging, time import string ###################################################################### # Authentication details. To obtain these visit dev.twitter.com ###################################################################### consumer_key = '<insert your consumer key>' consumer_secret = '<insert your consumer secret>' access_token = '<insert your access token>' access_token_secret = '<insert your access secret>' Now to create a connection we initiate the connection in the main loop: if __name__ == '__main__': listener = StdOutListener() #sign oath cert auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) #uncomment to use api in stream for data send/retrieve algorithms #api = tweepy.API(auth) stream = tweepy.Stream(auth, listener) ###################################################################### #Sample delivers a stream of 1% (random selection) of all tweets ###################################################################### client = KafkaClient("localhost:9092") producer = SimpleProducer(client) stream.sample() And to begin parsing the messages we must do so programmatically and any changes must be made in the source code. Now lets examine the process in NiFi. To avoid redundancy look at this thorough tutorial which illustrates the use of NiFi to ingest streaming twitter data. Twitter NiFi Example To create the connection we simply enter the API credentials into the twitter operator shown below: Then we can write these JSON objects to HDFS using the PutHDFS operator or we can parse out any information of interest while the message is in flight using the EvaluateJsonPath operator. Looking at these 2 approaches you can probably already see the benefits of choosing Nifi, Which is part of Hortonworks HDF "Data in Motion" platform. With NiFi our architecture is not limited and can be easily maintained and extended. Note that to write to Kafka we can simply add a PutKafka operator that branches the stream into a PutKafka operator and leave our original workflow intact. Also note the degree of programming knowledge required to manage and extend the Python application where NiFi can be managed visually with no need for code. The key takeaways here are the following: When making decisions about the technology to use in Big Data initiatives for streaming analytics platforms in an organization you should put a great deal of thought into: 1. Ease of implementation 2. Ease of extensibility 3. Openness of the technology The third point can't be emphasized enough. One of the biggest obstacles that organizations face when attempting to implement or extend their existing analytics applications is the lack of openness and the lock in caused by past technology decisions. Choosing to go with platforms like Hortonworks HDF and HDP allows for a complementary solution to existing architectures and technology that is already present in the organization and also leaves the architecture open to future innovations and allows the architecture and organization to keep up with the speed of innovation.

nlam · ‎10-19-2018

Update: The tag propagation feature (ATLAS-1821) was released as part of HDP 3.0.0.

vnv · ‎05-02-2016

Love this!! Already sent it to some close sales reps for a good laugh 🙂 Great job Dan!

solson · ‎04-14-2016

Additionally, you can find links to the complete set of HDF 1.2 repo locations in the HDF 1.2 Release Notes: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_RelNotes/content/ch_hdf_relnotes.html#hdf_repo

Online	Offline
Last Visited	‎02-05-2020 03:01 AM

Member Since	‎07-30-2019 09:17 AM
Last Visited	‎02-05-2020 03:01 AM
Posts	93
Kudos received	78

Cloudera Community

Re: Solr cores not being created due to error in X...

Re: How to install and configure a minimum system ...

Tag Hive data using Apache Atlas

Re: How to get Atlas up and running in HDP 2.5 San...

Re: Starting Atlas through Ambari in HDP 2.5 Sandb...

Re: HDP 2.5 TP Sandbox Virtual Box image not worki...

Re: Solr cores not being created due to error in X...

Re: Read from HDFS in R script

Comparison of NiFi to Python for streaming applica...

Re: When creating a view in Hive that includes an ...

Re: How to simulate a Sales Executive with HDF

Re: GZipped version of the Hortonworks DataFlow li...