New to Apache Atlas.
I am planning to use "Apache Atlas" for metadata repository and "Apache MetaModel" for querying data sources to fetch data. Apache Atlas seems promising for maintaining metadata repository. There is no API from Apache MetaModel for exchanging metadata with other tools.
My questions are:
1) Does Apache Atlas provide support for exchanging metadata with other Apache tools like Apache MetaModel
2) Can we browse/query data from data sources using Apache Atlas API
Atlas has a robust REST API defined here:
It is very easy to integrate with any external tools and/or systems. You can fetch entities, create entities, delete entities, create lineage among entities, ect. You can also define new types to model new external systems. For example, You want to track lineage from an Oracle database, through some ETL tool, and into Hive. First you would create a types that represent Oracle and the ETL process. Then you would write a single REST calls that gets sent to Atlas as part of the ETL processing that creates entities representing the specific Oracle database and the ETL process. The ETL process that makes the REST call should have all of the data necessary to create the new entities. If the ETL process also create the new Hive table or drops that data into an existing table, you would capture that in the above REST request as well. If it lands in HDFS first then the native Hive hook would capture the lineage when you move data from the external table into the managed Hive table. If done correctly, you should now be able to see all of the lineage from the Oracle database ending in the final Hive table. Exporting and importing data works about the same. You just script a process that gathers the data and interacts with Atlas via the REST api.
Atlas also provide a robust search interface. Under the covers, Atlas uses Solr to index the entities and their fields. This allows you to search using full text or domain specific language via a Web UI. This gives you the ability to create govern and provide a data discovery mechanism for data in and outside of HDP.
@Vadim; thanks for the detailed answer. It is a helpful explanation. My question is slightly different; I need to import metadata from external metadata sources (like Apache MetaModel) or directly from external data sources (like MySQL/Oracle).
OR can I connect to data sources from Apache Atlas and populate metadata repository without populating Hive data store.
I touched on this below. First you create the new types in Atlas. For example, in the case of Oracle, and Oracle table type, column type, ect. You would then create a script or process that pulls the meta data from the source meta data store. Once you have the meta data you want to store in Atlas, your process would create the associated Atlas entities, based on the new types, using the Java API or JSON representations through the REST API directly. If you wanted to, you could add lineage to that as you store the new entities. That's it... now the meta data from the external meta data store is in Atlas.
Thanks @Vadim Vaks for your replies and detailing the process.
Importing metadata/data from external sources (MySQL/Oracle) to Apache Atlas is a common scenario. Are you aware of any such scripts/tools that facilitate the data/metadata import to Apache Atlas OR do I need to write the ETL pipeline from scratch?
There is a tool that automatically imports data from the Hive Meta Store but not Oracle or MySQL.
You would need to MySQL/Oracle bridge yourself. You can use the Hive Bridge as your example
We are only talking about 400-500 lines of code using the Atlas Client Java API. You just query the meta data from Oracle/MySQL, create the Atlas types (DB/table/column), instantiate entities using the Atlas Java API, and then submit them to Atlas. Once the bridge is built you can just schedule it to run once every couple of hours. If you need real time updates of meta data, you could build an event based process into a trigger.
Do we have any example / article / tutorial for displaying Lineage and Impact in Atlas for Data in SAP to ETL to Hadoop.
We can import Metadata of DB and ETL processes into Hadoop but how to link this with hive table or show this into a single lineage graph which cover the source of data is SAP (DB) than there is a ETL process which lands to Hadoop