Created on 01-30-2019 07:21 PM - edited 08-17-2019 04:52 AM
Article | Key Points | Customized Type(s) Developed |
Part 1: Model governance, traceability and registry |
|
|
Part 2: Deep source metadata & embedded entities |
|
|
Part 3: Lineage beyond Hadoop, including reports & emails |
|
|
Goals of this article are to:
A full genomic analytics pipeline is shown in the diagram below.
Steps in the pipeline briefly are:
This will be represented in Atlas search and lineage as below (which is elaborated in the rest of the article).
The demokit repo provides instructions, which are quite simple: 1) set up a cluster (or sandbox), 2) on your local machine, export two environment variables and then run one script with no input params.
Running the demokit generates 5 such pipeline/lineage instances. If we do an unfiltered search on the gene_sequence type, for example, we get the results below. Clicking on the name of any search result allows a view of a single lineage as shown above.
The diagram below shows how customized types are represented in the pipeline/lineage.
The table that follows elaborates on each customized type.
Customized Type/ Entity | Entity represents: [platform] | Searchable Attributes | Article # |
device | gene sequencing device [lab] | deviceDecomDate deviceId deviceImplemDate deviceMake deviceModel deviceType name | 2 |
gene_sequence | raw sequence data ingested from device output [hadoop] | device (embedded, device) deviceQualifiedName name path runEndTime runReads runSampleId runStartTime runTechnician | 2 |
model | models used in primary, secondary, tertiary analytics [hadoop] | deployDate deployHostDetail deployHostType deployObjSource modelDescription modelEndTime modelName modelOwnerLob modelRegistryUrl modelStartTime modelVersion name | 1 |
report_engine | engine that generates report [reporting platform] | name reportEngHost reportEngRegistryUrl reportEngType reportEngVersion | 3 |
report | generated report [reporting platform] | name reportEndTime reportFilename reportName reportStartTime reportStorageUrl reportVersion | 3 |
email sent to doctor, with report attachment [reporting platform] | emailAttachment (embedded, report) emailBcc emailCc emailDate emailFrom emailSubject emailTo name | 3 |
The following are examples of searches you can do against pipelines (sudocode here). Run the demokit and try examples yourself.
Keep in mind that Atlas search can involve multiple constructs and can become quite complex. Search can be conducted from:
I hope these articles have given you an appreciation for how easily customizable Atlas is to represent metadata and lineage across your data and infrastructure landscape, and how powerful it is to search against it.
Keep in mind that we have not even covered classification (tags), tag-based Ranger policies and business glossary. These additional capabilities cement Atlas as a powerful tool to understand and manage the growing and complex world of data you live in.
Atlas is an outstanding governance tool to understand and manage your data landscape at scale ... and to easily customize governance specifically to your needs while seemlessly integrating Atlas' out of the box search, lineage, classification and business glossary capabilities.
The only thing holding you back is your imagination 🙂