Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
Labels (1)
Guru

97470-hcc-cust-atlas-3-draft.png

Introduction

In Customizing Atlas (Part1): Model governance, traceability and registry I provided a brief overview of Atlas types and entities and showed how to customize them to fit your needs. I showed the specific example of a Model type used to govern your deployed data science models and complex Spark code.

In Customizing Atlas (Part2): Deep source metadata and embedded entities I showed how to customize Atlas to hold knowledge of ingested data that goes deep beyond the data itself, e.g. detailes of the device that generated the data. I also showed how to implement the pattern of embedding an entity (not string) as an attribute value in your custom type. The result is a clickable hyperlink in the UI that opens that entity and its metadata.

In this post I will:

  • show how to extend your Atlas lineage to include processing and outputs on non-Hadoop systems
  • represent the above as a single lineage that connects data in Hadoop to a reporting system which generates a report and sends an email with the report attached
  • emphasize a key principle about Atlas: because its Rest API and customized types allow metadata to be sent and represented from any system, Atlas can centralize metadata from your entire data landscape.

Concepts and Example

Main Concept

The main concepts is summarized below.

97471-hcc-cust-atlas-3-page-9.png

The Atlas lineage includes processing and outputs on systems beyond Hadoop.

Example: Reporting system

The reporting system:

  • inputs HDFS data
  • outputs a report and archives it
  • attaches the report in an email and sends the email

This is shown in the digram below.

97472-hcc-cust-atlas-3-page-2-3.png

97473-hcc-cust-atlas-3-page-10.png

Implementation

ReportGenerator type is implemented as follows (one-time operation):

curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
  "enumDefs": [],
  "structDefs": [],
  "classificationDefs": [],
  "entityDefs": [
     {
      "superTypes": ["Process"],
      "name": "reportGenerator",
      "typeVersion": "1.0",
      "attributeDefs": [
         {
         "name": "qualifiedName",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "name",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
          "name": "inputs",
          "typeName": "array<DataSet>",
          "isOptional": true,
          "cardinality": "SET",
          "valuesMinCount": 0,
          "valuesMaxCount": 2147483647,
          "isUnique": false,
          "isIndexable": false,
          "includeInNotification": false
         },
         {
          "name": "outputs",
          "typeName": "array<DataSet>",
          "isOptional": true,
          "cardinality": "SET",
          "valuesMinCount": 0,
          "valuesMaxCount": 2147483647,
          "isUnique": false,
          "isIndexable": false,
          "includeInNotification": false
         },
         {
         "name": "reportGenRegistryUrl",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportGenVersion",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportGenType",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportGenHost",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
        }
      ]
      }
   ]
}'

ReportGenerator entity is implemented as follows:

curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
  "entities": [
    {
      "typeName": "reportGenerator",
      "attributes": {
        "qualifiedName": "reportProcessor-v2.4@reportserver.genomiccompany.com",
        "name": "disease-risk-report-v1.3",
        "inputs": [{"uniqueAttributes": {"qualifiedName": "/data/genomics/variants/sample-AB15423@prodCluster"}, "typeName": "hdfs_path"}],
        "outputs": [
            {"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423.pdf"}, "typeName": "report"},
            {"uniqueAttributes": {"qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432"}, "typeName": "email"}
            ],
        "reportGenRegistryUrl": "https://git@github.com/reportengines/genomics/predictive-general",
        "reportGenVersion": "2.4",
        "reportGenType": "variant-disease-risk",
        "reportGenHost": "reportserver.genomiccompany.com"
      }
    }
  ]
}'

Report type is implemented as follows (one-time operation):

curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
  "enumDefs": [],
  "structDefs": [],
  "classificationDefs": [],
  "entityDefs": [
     {
      "superTypes": ["DataSet"],
      "name": "report",
      "typeVersion": "1.0",
      "attributeDefs": [
         {
         "name": "qualifiedName",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": true,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "owner",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "name",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportName",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportVersion",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportFilename",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportStorageURL",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportStartTime",
         "typeName": "date",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "reportEndTime",
         "typeName": "date",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         }
     ]
     }
  ]
}'

Report entity is implemented as follows:

curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
  "entities": [
    {
      "typeName": "report",
      "attributes": {
        "qualifiedName": "disease-risk-gen-variance@AB15423.pdf",
        "owner": "jobscheduler",
        "name": "disease-risk-gen-variance",
        "reportName": "genomics disease risk report - sample AB15423",
        "reportVersion": "1.1",
        "reportFilename": "genomics-disease-AB15423.pdf",
        "reportStorageURL": "s3://genomics-disease/AB15423.pdf",
        "reportStartTime": "2018-11-12T09:54:12.432Z",
        "reportEndTime": "2018-11-12T09:54:14.341Z"
      }
    }
  ]
}'

Email type is implemented as follows (one-time operation):

curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
  "enumDefs": [],
  "structDefs": [],
  "classificationDefs": [],
  "entityDefs": [
     {
      "superTypes": ["DataSet"],
      "name": "email",
      "typeVersion": "1.0",
      "attributeDefs": [
         {
         "name": "qualifiedName",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": true,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "owner",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "name",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailTo",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailFrom",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailCc",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailBcc",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailSubject",
         "typeName": "string",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailAttachments",
         "typeName": "array<DataSet>",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         },
         {
         "name": "emailDate",
         "typeName": "date",
         "cardinality": "SINGLE",
         "isUnique": false,
         "isOptional": false,
         "isIndexable": true
         }
     ]
     }
  ]
}'

Email entity is implemented as follows:

curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
  "entities": [
    {
      "typeName": "email",
      "attributes": {
        "qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432",
        "owner": "jobscheduler",
        "name": "email ",
        "emailTo": "drsmith@thehospital.com",
        "emailFrom": "me@genomicscompany.com",
        "emailCc": "archives@thehospital.com",
        "emailBcc": "",
        "emailSubject": "genomics disease risk report - patient AB15423",
        "emailAttachments": [{"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423_r-AA345744.pdf"}, "typeName": "report"}],
        "emailDate": "2018-11-12T09:54:14.000Z"
      }
    }
  ]
}'

Results in Atlas UI

Lineage

Lineage shows the HDFS input file and the processing and output on the reporting system.

97474-screen-shot-2018-12-27-at-30608-pm.png

Report Generator

When we click the blue gear we see full metadata on the reporting engine, including host machine and url to its artifacts (e.g. deployed binary, code, etc). This is shown in screenshot below.

97475-screen-shot-2018-12-27-at-31025-pm.png

Report

From the lineage when we click the report output, we see full metadata on the generated report, including filename, archive location and creation time. This is shown in screenshot below.

97476-screen-shot-2018-12-27-at-31933-pm.png

Email

From the lineage when we click email output, we see familiar information about an email, including to, from, cc, subject and date. This is shown in screenshot below.

105404-screen-shot-2018-12-27-at-32238-pm.png

Note that the attachment field shows a clickable link to the report entity that is attached. Clicking this link leads the same report screen as shown above.

Summary: What have we accomplished?

We can:

  • show a single lineage of data processing extending from Hadoop and continuing on non-Hadoop systems
  • represent metadata of processing and outputs on non-Hadoop systems, including in the example here: report engines, emails, and reports

The ideas here can be generalized for you to represent lineage and metadata of processing on any non-Hadoop system. Keep in mind also that you can continue the lineage to multiple systems both upstream and downstream from Hadoop, eg. external -> Hadoop -> external -> external.

Key points:

  • Atlas is hosted on Hadoop, but its Rest API allows you to send processing knowledge from system beyond Hadoop
  • Customized Atlas types let you integrate external system knowledge natively in Atlas
  • As a result, you can represent knowledge of your full data ecosystem in Atlas, including search and lineage.

So ... go out and tame your data landscape with centralized metadata in Atlas that reaches well-beyond Hadoop!

References

562 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 05:15 AM
Updated by:
 
Contributors
Top Kudoed Authors