Created 11-25-2016 12:39 PM
Hi Everyone,
I have read so many blogs and document over internet regarding Apache atlas and Apache falcon and have done some POC also using these tools.but here,I don't understand what is the actual difference between these tool?
As per my understanding both the tools are committing to provide data management life cycle and data governance featuresalso.so I am little bit confused here and feeling that both are providing similar features.
I don't understand which tool I should use in my use case for data governance as both are giving lineage?.
Here i am confused that where these above tool will fit in my use case(general questionj)?.
Thanks in advance.
Created 11-25-2016 01:54 PM
Hi,
Atlas and Falcon serve very different purposes, but there are some areas where they touch base. Maybe that is where your confusion comes from.
Atlas:
-really like an 'atlas' to almost all of the metadata that is around in HDP like Hive metastore, Falcon repo, Kafka topics, Hbase table etc. This single view on metadata makes for some powerfull searching capabilities on top of that with full text search (based on solr)
-Since Atlas has this comprehensive view on metadata it is also capable of providing insight in lineage, so it can tell by combining Hive DDL's what table was the source for another table.
-Another core feature is that you assign tags to all metadata entities on Atlas. So you can say that column B in Hive table Y holds sensitive data by assigning a 'PII' tag to it. But a hdfs folder can also be assigned a 'PII' tag or a CF from Hbase. From there you can create tag based policies from Ranger to manage access to anything 'PII' tagged in Atlas.
Falcon:
-more like a scheduling and execution engine for HDP components like Hive, Spark, hdfs distcp, Sqoop to move data around and/or process data along the way. In a way Falcon is a much improved Oozie.
-metadata of Falcon dataflows is actually sinked to Atlas through Kafka topics so Atlas knows about Falcon metadata too and Atlas can include Falcon processes and its resulting meta objects (tables, hdfs folders, flows) into its lineage graphs.
I know that in the docs both tools claim the term 'data governance', but I feel Atlas is more about that then Falcon is. It is not that clear what Data Governance actually is. With Atlas you can really apply governance by collecting all metadata querying and tagging it and Falcon can maybe execute processes that evolve around that by moving data from one place to another (and yes, Falcon moving a dataset from an analysis cluster to an archiving cluster is also about data governance/management)
Hope that helps
Created 11-25-2016 01:54 PM
Hi,
Atlas and Falcon serve very different purposes, but there are some areas where they touch base. Maybe that is where your confusion comes from.
Atlas:
-really like an 'atlas' to almost all of the metadata that is around in HDP like Hive metastore, Falcon repo, Kafka topics, Hbase table etc. This single view on metadata makes for some powerfull searching capabilities on top of that with full text search (based on solr)
-Since Atlas has this comprehensive view on metadata it is also capable of providing insight in lineage, so it can tell by combining Hive DDL's what table was the source for another table.
-Another core feature is that you assign tags to all metadata entities on Atlas. So you can say that column B in Hive table Y holds sensitive data by assigning a 'PII' tag to it. But a hdfs folder can also be assigned a 'PII' tag or a CF from Hbase. From there you can create tag based policies from Ranger to manage access to anything 'PII' tagged in Atlas.
Falcon:
-more like a scheduling and execution engine for HDP components like Hive, Spark, hdfs distcp, Sqoop to move data around and/or process data along the way. In a way Falcon is a much improved Oozie.
-metadata of Falcon dataflows is actually sinked to Atlas through Kafka topics so Atlas knows about Falcon metadata too and Atlas can include Falcon processes and its resulting meta objects (tables, hdfs folders, flows) into its lineage graphs.
I know that in the docs both tools claim the term 'data governance', but I feel Atlas is more about that then Falcon is. It is not that clear what Data Governance actually is. With Atlas you can really apply governance by collecting all metadata querying and tagging it and Falcon can maybe execute processes that evolve around that by moving data from one place to another (and yes, Falcon moving a dataset from an analysis cluster to an archiving cluster is also about data governance/management)
Hope that helps