Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is the difference between Apache atlas and Apache falcon?

Solved Go to solution
Highlighted

What is the difference between Apache atlas and Apache falcon?

Expert Contributor

Hi Everyone,

I have read so many blogs and document over internet regarding Apache atlas and Apache falcon and have done some POC also using these tools.but here,I don't understand what is the actual difference between these tool?

As per my understanding both the tools are committing to provide data management life cycle and data governance featuresalso.so I am little bit confused here and feeling that both are providing similar features.

I don't understand which tool I should use in my use case for data governance as both are giving lineage?.

Here i am confused that where these above tool will fit in my use case(general questionj)?.

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: What is the difference between Apache atlas and Apache falcon?

Super Collaborator

@Manoj Dhake

Hi,

Atlas and Falcon serve very different purposes, but there are some areas where they touch base. Maybe that is where your confusion comes from.

Atlas:

-really like an 'atlas' to almost all of the metadata that is around in HDP like Hive metastore, Falcon repo, Kafka topics, Hbase table etc. This single view on metadata makes for some powerfull searching capabilities on top of that with full text search (based on solr)

-Since Atlas has this comprehensive view on metadata it is also capable of providing insight in lineage, so it can tell by combining Hive DDL's what table was the source for another table.

-Another core feature is that you assign tags to all metadata entities on Atlas. So you can say that column B in Hive table Y holds sensitive data by assigning a 'PII' tag to it. But a hdfs folder can also be assigned a 'PII' tag or a CF from Hbase. From there you can create tag based policies from Ranger to manage access to anything 'PII' tagged in Atlas.

Falcon:

-more like a scheduling and execution engine for HDP components like Hive, Spark, hdfs distcp, Sqoop to move data around and/or process data along the way. In a way Falcon is a much improved Oozie.

-metadata of Falcon dataflows is actually sinked to Atlas through Kafka topics so Atlas knows about Falcon metadata too and Atlas can include Falcon processes and its resulting meta objects (tables, hdfs folders, flows) into its lineage graphs.

I know that in the docs both tools claim the term 'data governance', but I feel Atlas is more about that then Falcon is. It is not that clear what Data Governance actually is. With Atlas you can really apply governance by collecting all metadata querying and tagging it and Falcon can maybe execute processes that evolve around that by moving data from one place to another (and yes, Falcon moving a dataset from an analysis cluster to an archiving cluster is also about data governance/management)

Hope that helps

View solution in original post

1 REPLY 1
Highlighted

Re: What is the difference between Apache atlas and Apache falcon?

Super Collaborator

@Manoj Dhake

Hi,

Atlas and Falcon serve very different purposes, but there are some areas where they touch base. Maybe that is where your confusion comes from.

Atlas:

-really like an 'atlas' to almost all of the metadata that is around in HDP like Hive metastore, Falcon repo, Kafka topics, Hbase table etc. This single view on metadata makes for some powerfull searching capabilities on top of that with full text search (based on solr)

-Since Atlas has this comprehensive view on metadata it is also capable of providing insight in lineage, so it can tell by combining Hive DDL's what table was the source for another table.

-Another core feature is that you assign tags to all metadata entities on Atlas. So you can say that column B in Hive table Y holds sensitive data by assigning a 'PII' tag to it. But a hdfs folder can also be assigned a 'PII' tag or a CF from Hbase. From there you can create tag based policies from Ranger to manage access to anything 'PII' tagged in Atlas.

Falcon:

-more like a scheduling and execution engine for HDP components like Hive, Spark, hdfs distcp, Sqoop to move data around and/or process data along the way. In a way Falcon is a much improved Oozie.

-metadata of Falcon dataflows is actually sinked to Atlas through Kafka topics so Atlas knows about Falcon metadata too and Atlas can include Falcon processes and its resulting meta objects (tables, hdfs folders, flows) into its lineage graphs.

I know that in the docs both tools claim the term 'data governance', but I feel Atlas is more about that then Falcon is. It is not that clear what Data Governance actually is. With Atlas you can really apply governance by collecting all metadata querying and tagging it and Falcon can maybe execute processes that evolve around that by moving data from one place to another (and yes, Falcon moving a dataset from an analysis cluster to an archiving cluster is also about data governance/management)

Hope that helps

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here