Created on 09-29-201601:59 AM - edited 08-17-201909:31 AM
One key feature in Apache Atlas is the ability to track data lineage in your Data Lake visually. This allows you to very quickly understand the lifecycle of your data and answer questions about where the data originated from and how it relates to other data in the Data Lake.
To Illustrate this we will use our own twitter data to perform sentiment analytics on our tweets in Hive and see how this is reflected in Apache Atlas.
By now you should have a working sandbox or HDP environment up and running with Atlas enabled. If not please take a look at the following tutorial to help get you started: Getting Started with Atlas in HDP 2.5
First we need to gather the data sets we will use in this tutorial.
Log into twitter
Click on your account settings at the top right
Select "Your Twitter Data" from the list on the left side of your screen.
Now enter your password and in the "Other Data" section at the bottom select "Twitter Archive"
This might take a little time but you will get a link to download your archive soon.
While you wait on that data lets quickly grab the sentiment library we will use in this tutorial.
Here is the zip you will need to download. AFINN Data
In it we will need the AFINN-111.txt file.
Now that we have the data let's go to the Hive View through Ambari and click the "Upload Table" link.
Now just navigate to the tweets.csv file that is located in your twitter archive.
***If you need a twitter dataset to use for this step I have made mine public here : DataSample
You will need to click the small gear icon next to the file type to specify the header row exists.
Now upload the table and repeat the steps for the AFINN-111.txt file. Name it sentiment_dictionary to work with the rest of this tutorial.
Make the column names "word" and "rating"
Now that we have the required data let's perform some transformations in Hive.
Back in Ambari in your Hive View open up a new query window.
Create a table to store the words in our tweet text as an array:
CREATE TABLE words_array AS SELECT tweet_id AS id, split(text,' ') AS words FROM tweets;
Create a table that explodes the array into individual words:
CREATE TABLE tweet_word AS SELECT id AS id, word FROM words_array LATERAL VIEW explode(words) w as word;
Now JOIN the sentiment_dictionary to the tweet_word table
CREATE TABLE word_join AS SELECT tweet_word.id, tweet_word.word, sentiment_dictionary.rating FROM tweet_word LEFT OUTER JOIN sentiment_dictionary ON (tweet_word.word=sentiment_dictionary.word);
Great! Now we have each word rated for sentiment. The range is from -5 to +5.
From here what you decide to do with this data is a topic for a different article; however, now that we have created the word_join table we can jump back to Atlas to inspect the lineage information associated to our new dataset.
In the Atlas UI search for word_join
Notice the connections to all parent tables and the recording of the actual SQL statements we executed during the transformations.