Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Apache Atlas - support for Spark

Apache Atlas - support for Spark

Expert Contributor

hi All -

Does Apache Atlas support Data lineage for Spark ?

Is not - when is that expected ?

6 REPLIES 6

Re: Apache Atlas - support for Spark

No, it's not supported out of the box. The only way to integrate Spark with Atlas now is to call Atlas API from your Spark application, either using REST API, or Java API. In this document there is an example how to integrate HBase using REST API, and here you can find Maven artifacts for Java API. Either way it will be an intricate project but Java API is easier I think. When is support expected? I don't know, a general approach is not easy, so probably not so soon.

Re: Apache Atlas - support for Spark

Quite unbelievable that Spark is so poorly supported (in Atlas) !

I did some research, I could not find a good 'data lineage' (data Management) solution that integrates well with Spark.

Very happy for other ideas!

Best solution/tools I could find, that have at least spark integration (but none seem usable in a 'standalone' fashion):

* Cloudera Navigator, which is both closed-source and can't be used standalone (only in a Cloudera cluster deployment)

* cask.co's CDAP: http://cask.co/products/cdap/ shows nice features in section "Metadata & Lineage"

* Talend's platform similarly: https://www.talend.com/blog/2016/10/10/five-pillars-for-succeeding-in-big-data-governance-and-metada...

* LinkedIn WhereHows: Only 2 mentions searching for 'spark' on their github project, not looking promising: https://github.com/linkedin/WhereHows/issues/238

Re: Apache Atlas - support for Spark

New Contributor

Hi, only a remark:

Cloudera Navigator only support Spark SQL lineage ( at dataframe level ), but RDD lineage is not supported. Maybe it would be a good starting point to catch lineage through Spark HiveContext requests to Hive metastore ¿?.

Re: Apache Atlas - support for Spark

New Contributor

Any plan for spark support in upcoming HDP release.

Highlighted

Re: Apache Atlas - support for Spark

Super Collaborator

Hi:

you could try with this:

https://absaoss.github.io/spline/

Re: Apache Atlas - support for Spark

Cloudera Employee