Created on 12-27-2018 09:02 PM - edited 08-17-2019 05:15 AM
In Customizing Atlas (Part1): Model governance, traceability and registry I provided a brief overview of Atlas types and entities and showed how to customize them to fit your needs. I showed the specific example of a Model type used to govern your deployed data science models and complex Spark code.
In Customizing Atlas (Part2): Deep source metadata and embedded entities I showed how to customize Atlas to hold knowledge of ingested data that goes deep beyond the data itself, e.g. detailes of the device that generated the data. I also showed how to implement the pattern of embedding an entity (not string) as an attribute value in your custom type. The result is a clickable hyperlink in the UI that opens that entity and its metadata.
In this post I will:
The main concepts is summarized below.
The Atlas lineage includes processing and outputs on systems beyond Hadoop.
The reporting system:
This is shown in the digram below.
ReportGenerator type is implemented as follows (one-time operation):
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["Process"], "name": "reportGenerator", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "inputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "outputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "reportGenRegistryUrl", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportGenVersion", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportGenType", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportGenHost", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }'
ReportGenerator entity is implemented as follows:
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "reportGenerator", "attributes": { "qualifiedName": "reportProcessor-v2.4@reportserver.genomiccompany.com", "name": "disease-risk-report-v1.3", "inputs": [{"uniqueAttributes": {"qualifiedName": "/data/genomics/variants/sample-AB15423@prodCluster"}, "typeName": "hdfs_path"}], "outputs": [ {"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423.pdf"}, "typeName": "report"}, {"uniqueAttributes": {"qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432"}, "typeName": "email"} ], "reportGenRegistryUrl": "https://git@github.com/reportengines/genomics/predictive-general", "reportGenVersion": "2.4", "reportGenType": "variant-disease-risk", "reportGenHost": "reportserver.genomiccompany.com" } } ] }'
Report type is implemented as follows (one-time operation):
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["DataSet"], "name": "report", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportVersion", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportFilename", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportStorageURL", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportStartTime", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportEndTime", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }'
Report entity is implemented as follows:
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "report", "attributes": { "qualifiedName": "disease-risk-gen-variance@AB15423.pdf", "owner": "jobscheduler", "name": "disease-risk-gen-variance", "reportName": "genomics disease risk report - sample AB15423", "reportVersion": "1.1", "reportFilename": "genomics-disease-AB15423.pdf", "reportStorageURL": "s3://genomics-disease/AB15423.pdf", "reportStartTime": "2018-11-12T09:54:12.432Z", "reportEndTime": "2018-11-12T09:54:14.341Z" } } ] }'
Email type is implemented as follows (one-time operation):
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["DataSet"], "name": "email", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailTo", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailFrom", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailCc", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailBcc", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailSubject", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailAttachments", "typeName": "array<DataSet>", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailDate", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }'
Email entity is implemented as follows:
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "email", "attributes": { "qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432", "owner": "jobscheduler", "name": "email ", "emailTo": "drsmith@thehospital.com", "emailFrom": "me@genomicscompany.com", "emailCc": "archives@thehospital.com", "emailBcc": "", "emailSubject": "genomics disease risk report - patient AB15423", "emailAttachments": [{"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423_r-AA345744.pdf"}, "typeName": "report"}], "emailDate": "2018-11-12T09:54:14.000Z" } } ] }'
Lineage shows the HDFS input file and the processing and output on the reporting system.
When we click the blue gear we see full metadata on the reporting engine, including host machine and url to its artifacts (e.g. deployed binary, code, etc). This is shown in screenshot below.
From the lineage when we click the report output, we see full metadata on the generated report, including filename, archive location and creation time. This is shown in screenshot below.
From the lineage when we click email output, we see familiar information about an email, including to, from, cc, subject and date. This is shown in screenshot below.
Note that the attachment field shows a clickable link to the report entity that is attached. Clicking this link leads the same report screen as shown above.
The ideas here can be generalized for you to represent lineage and metadata of processing on any non-Hadoop system. Keep in mind also that you can continue the lineage to multiple systems both upstream and downstream from Hadoop, eg. external -> Hadoop -> external -> external.
So ... go out and tame your data landscape with centralized metadata in Atlas that reaches well-beyond Hadoop!