Created on 05-27-2016 06:39 PM - edited on 02-11-2020 08:59 PM by VidyaSargur
Before completing this tutorial, it is important to understand data lineage.
Data lineage is defined as a data life cycle that conveys data origin and where data moves over time. In Apache Hive, if I create a table (TableA) and then insert data (from another table TableB), the data lineage will display TableA as the target and Table B as the source/origin. These two tables are linked together by a process "insert into Table..", allowing a user to understand the data life cycle. In a Hadoop ecosystem, Apache Atlas contains the data lineage for various systems like Apache Hive, Apache Falcon and Apache Sqoop.
Apache Atlas is a centralized governance framework that supports the Hadoop ecosystem as a metastore repository. To add metadata to Atlas, libraries called ‘hooks’ are enabled in various systems which automatically capture metadata events in the respective systems and propagate those events to Atlas. (More on Atlas' Architecture).
Once the Atlas-Ranger VM is running, you can login through an SSH shell with user = root, password = hadoop
Atlas UI: http://localhost:21000 (use: Data Lineage), user = admin, password = admin
Ambari UI: http://localhost:8080 (use: Hive View), user = admin, password = admin
create table brancha(full_name string, ssn string, location string);
create table branchb(full_name string, ssn string, location string);
insert into brancha(full_name,ssn,location) values ('ryan', '111-222-333', 'chicago'); insert into brancha(full_name,ssn,location) values ('brad', '444-555-666', 'minneapolis'); insert into brancha(full_name,ssn,location) values ('rupert', '000-000-000', 'chicago'); insert into brancha(full_name,ssn,location) values ('john', '555-111-555', 'boston'); insert into branchb(full_name,ssn,location) values ('jane', '666-777-888', 'dallas'); insert into branchb(full_name,ssn,location) values ('andrew', '999-999-999', 'tampa'); insert into branchb(full_name,ssn,location) values ('ryan', '111-222-333', 'chicago'); insert into branchb(full_name,ssn,location) values ('brad', '444-555-666', 'minneapolis');
(Using Atlas-Ranger preview - execute one insert statement at a time)
Step 6 - In the Atlas UI, select the hyperlink under the column name "default.brancha@abc"
create table branch_intersect as select b1.full_name,b1.ssn,b1.location from brancha b1 inner join branchb b2 ON b1.ssn = b2.ssn;
(orange = current table) You can see source brancha had a process of “create table br...” populating the target branch_intersect table
Created on 05-31-2016 01:46 PM
Hi Ryan, nice demo, seems some of the confusion when you look through the lineage type questions, is where lineage begins. this is a loaded question, but why would lineage not begin with the initial input of data to a table through say through hive view off of ambari, or beeline script, etc. Curious of your thoughts
Created on 06-28-2016 12:57 AM
Hi,this is the demo that help me well. But when I execeted the command:
insert into brancha(full_name,ssn,location) values ('ryan', '111-222-333', 'chicago');
It report error like this:
java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/usr/local/data-governance/apache-atlas-0.7-incubating-SNAPSHOT/hook/hive/atlas-client-0.7-incubating-SNAPSHOT.ja
The detail of this issue is posted on my another thread:
Please check. I hope you can help me. Thank you very much.