Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to create data lineage in Atlas for data that is copied from local to HDFS and transformed/processed by Spark SQL

avatar
New Contributor

I have 2 datasets that are copied to HDFS from Local and they were joined and transformed using Spark SQL and stored as a single dataset in HDFS. I was able to capture the meta data information and push it to Atlas by going through the Atlas REST API as it provide POST methods for pushing the JSON file into Atlas. whereas for Data Lineage i could only see the GET method. How to create data lineage in this scenario?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Lineage is generate with type definitions called Process and DataSet, usually when you create these with sufficient information depicting the "Process" of copying "DataSet" from HDFS to Local and similarly for what's happening in the Spark realm, Atlas should be able to generate the Lineage info for you.

All you need it to create the Process and Dataset entities for the above scenario. HTH

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Lineage is generate with type definitions called Process and DataSet, usually when you create these with sufficient information depicting the "Process" of copying "DataSet" from HDFS to Local and similarly for what's happening in the Spark realm, Atlas should be able to generate the Lineage info for you.

All you need it to create the Process and Dataset entities for the above scenario. HTH

avatar
New Contributor

Thanks for your response. I found the technical details of the approach you mentioned in this Link.