Created on 05-27-201606:39 PM - edited on 02-11-202008:59 PM by VidyaSargur
Before completing this tutorial, it is important to understand data lineage.
What is Data Lineage
Data lineage is defined as a data life cycle that conveys data origin and where data moves over time. In Apache Hive, if I create a table (TableA) and then insert data (from another table TableB), the data lineage will display TableA as the target and Table B as the source/origin. These two tables are linked together by a process "insert into Table..", allowing a user to understand the data life cycle. In a Hadoop ecosystem, Apache Atlas contains the data lineage for various systems like Apache Hive, Apache Falcon and Apache Sqoop.
What is Apache Atlas
Apache Atlas is a centralized governance framework that supports the Hadoop ecosystem as a metastore repository. To add metadata to Atlas, libraries called ‘hooks’ are enabled in various systems which automatically capture metadata events in the respective systems and propagate those events to Atlas. (More on Atlas' Architecture).