How Hive notifies Atlas about any DML/DDL operation in Atlas against which Atlas generates lineage? What is the whole flow? and what is the information that Hive sends to Atlas?
Hi Saba, this hive architecture doc from apache gives a good breakdown, if this does not answer your question, please follow up and I can dive a little deeper. http://atlas.incubator.apache.org/Bridge-Hive.html
This diagram may help also, and the most recent announcement with deeper IBM partnership will also help excel this space, much more to come.
I want to know that if any change in Hive database occurs e.g we create table B from table A then is Hive DB going to notify Atlas Server on its own or HiveHook is going to check constantly in the Hive DB and pull the changes?
These are good questions. I hope I am able to do justice to them with my answers.
To elaborate little more on what @Sarath Subramanian said. Kafka is used to do the work of relaying the notifications from Hive to Atlas. Hive publishes to a topic and Atlas subscribes to that and thus receives the notifications.
There has been some discussion on using Atlas for MySQL and Oracle. I have not seen any implementation yet. This is possible, provided these 2 products have notification mechanisms. From what I know, these have database change triggers that be used to call a REST API or push some message onto a queue or publish to Kafka.
For Oracle, this is what i found.
Hope this helps.
>> How Hive notifies Atlas about any DML/DDL operation in Atlas against which Atlas generates lineage?
Whenever there is any metadata change events in Hive , HiveHook captures it and puts the details of created/updated hive entity to a kafka topic called ATLAS_HOOK. Atlas is the consumer of the ATLAS_HOOK. So Atlas gets the message from ATLAS_HOOK.
>> what is the information that Hive sends to Atlas?
hive > create table emp(id int,name string);
1.HiveHook composes a JSON message that contains information about table name , database , columns and other table properties and sends it to ATLAS_HOOK.
2. ATLAS_HOOK queues up the messages from HiveHook and Atlas consumes from it. Atlas consumes the JSON message about table emp and ingests it.
hive > create table t_emp as select * from emp;
1.HiveHook composes JSON message that contains t_emp details and also the source table name (emp) and sends to ATLAS_HOOK.
2.Atlas understands from the JSON message consumed from ATLAS_HOOK , that it is a CTAS table and it has a source table , ingests the table t_emp and constructs lineage for the tables emp and t_emp.
>> is Hive DB going to notify Atlas Server on its own or HiveHook is going to check constantly in the Hive DB and pull the changes
HiveHook doesn't check hive constantly all time. Whenever there is any metadata event change ( like when user fires a hive query that involves creation/updation/drop ) , HiveHook notifies ATLAS_HOOK.
If you want to know more about the exact JSON content sent by HiveHook , you can create a table in hive and check the message that lands in ATLAS_HOOK for that table.
Thanks @Sharmadha Sainath for the quick and brief response :)Is there any hook available for MySQL or Oracle?
If not then how can we make a Hook for MySQL or Oracle? What can be the process involved for creating it?
There is a Sqoop atlas hook if that is the way you are pulling the data from the MySql/Oracle, that will track the lineage. You also have a full set of API's into Atlas (here is a swagger page that will give perspective - http://atlas.incubator.apache.org/api/v2/ui/index.html ), also there are third party Gov tooling that integrates, growing every quarter, and with the latest extended partnership with IBM we will see more around Atlas in the near future - little info at here - http://www.kmworld.com/Articles/News/News/IBM-and-Hortonworks-Expand-Partnership-118808.aspx