hi All -
Does Apache Atlas support Data lineage for Spark ?
Is not - when is that expected ?
No, it's not supported out of the box. The only way to integrate Spark with Atlas now is to call Atlas API from your Spark application, either using REST API, or Java API. In this document there is an example how to integrate HBase using REST API, and here you can find Maven artifacts for Java API. Either way it will be an intricate project but Java API is easier I think. When is support expected? I don't know, a general approach is not easy, so probably not so soon.
Quite unbelievable that Spark is so poorly supported (in Atlas) !
I did some research, I could not find a good 'data lineage' (data Management) solution that integrates well with Spark.
Very happy for other ideas!
Best solution/tools I could find, that have at least spark integration (but none seem usable in a 'standalone' fashion):
* Cloudera Navigator, which is both closed-source and can't be used standalone (only in a Cloudera cluster deployment)
* cask.co's CDAP: http://cask.co/products/cdap/ shows nice features in section "Metadata & Lineage"
* Talend's platform similarly: https://www.talend.com/blog/2016/10/10/five-pillars-for-succeeding-in-big-data-governance-and-metada...
* LinkedIn WhereHows: Only 2 mentions searching for 'spark' on their github project, not looking promising: https://github.com/linkedin/WhereHows/issues/238
Hi, only a remark:
Cloudera Navigator only support Spark SQL lineage ( at dataframe level ), but RDD lineage is not supported. Maybe it would be a good starting point to catch lineage through Spark HiveContext requests to Hive metastore ¿?.
This is a Tech Preview feature in HDP 3.0.1