Created 03-02-2017 12:58 AM
hi All -
Does Apache Atlas support Data lineage for Spark ?
Is not - when is that expected ?
Created 03-02-2017 07:46 AM
No, it's not supported out of the box. The only way to integrate Spark with Atlas now is to call Atlas API from your Spark application, either using REST API, or Java API. In this document there is an example how to integrate HBase using REST API, and here you can find Maven artifacts for Java API. Either way it will be an intricate project but Java API is easier I think. When is support expected? I don't know, a general approach is not easy, so probably not so soon.
Created 04-05-2017 11:13 AM
Quite unbelievable that Spark is so poorly supported (in Atlas) !
I did some research, I could not find a good 'data lineage' (data Management) solution that integrates well with Spark.
Very happy for other ideas!
Best solution/tools I could find, that have at least spark integration (but none seem usable in a 'standalone' fashion):
* Cloudera Navigator, which is both closed-source and can't be used standalone (only in a Cloudera cluster deployment)
* cask.co's CDAP: http://cask.co/products/cdap/ shows nice features in section "Metadata & Lineage"
* Talend's platform similarly: https://www.talend.com/blog/2016/10/10/five-pillars-for-succeeding-in-big-data-governance-and-metada...
* LinkedIn WhereHows: Only 2 mentions searching for 'spark' on their github project, not looking promising: https://github.com/linkedin/WhereHows/issues/238
Created 01-09-2018 10:38 PM
Hi, only a remark:
Cloudera Navigator only support Spark SQL lineage ( at dataframe level ), but RDD lineage is not supported. Maybe it would be a good starting point to catch lineage through Spark HiveContext requests to Hive metastore ¿?.
Created 04-25-2018 09:03 AM
Any plan for spark support in upcoming HDP release.
Created 05-16-2018 02:27 PM
Created 12-11-2018 12:15 AM
This is a Tech Preview feature in HDP 3.0.1
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/release-notes/content/tech_previews.html
Please see https://github.com/hortonworks-spark/spark-atlas-connector
Created 08-06-2021 03:54 AM
Good news. CDP is supported Spark with Atlas integration.
Note: In HDP it is experimental only.
References: