Community Articles

gkeys · ‎01-30-2019

Customizing Atlas: Summary of Work to Date

Article	Key Points	Customized Type(s) Developed
Part 1: Model governance, traceability and registry	Quick primer on Atlas types, entities, attributes, lineage, search Quick primer on customizing Atlas Use the Atlas Rest API to customize any type, entity, attributes you wish Customizations integrate seamlessly with out of the box Atlas lineage and search capabilities Notes on operationalizing	model to represent your deployed data science and complex Spark ETL models (what was deployed, which version, when, what are its concrete artifacts, etc)
Part 2: Deep source metadata & embedded entities	Use the Atlas Rest API to customize any type/entity/attributes you wish You can use a hyplerlinked entity (vs text) as a value to an attribute (embedded entity pattern) HDFS entities can hold deep metadata from source	device to represent a hardware device (in this case gene sequencing device) gene_sequence to represent gene sequencing data landed in HDFS, as well as its source device and sequence run back in the lab
Part 3: Lineage beyond Hadoop, including reports & emails	Atlas Rest API can be sent from any networked system This allows metadata from that system to be pushed to Atlas This allows entities beyond Hadoop to be represented natively in Atlas Therefore, Atlas metadata, search and lineage can span across the data and infrastructure landscape	report_engine to represent a report generating software deployment report to represent a report generated by the report engine email to represent an email that has been sent, including hyperlink to report entity as an email attachment

Goals of this Article

Goals of this article are to:

Summarize: combine all of the previous article customizations and topics into a complex data pipeline/lineage example: genomic analytics pipeline from gene sequencing in the lab, multi-step genomic analytics on Hadoop, to report emailed to clinician
Demokit: provide a single-command shell script that builds 5 such pipelines in Atlas, which then allows you to explore Atlas' powerful customization, search, lineage and general governance capabilities. The demokit is available at this github repo.

Background: Genomic Analytics Pipeline

A full genomic analytics pipeline is shown in the diagram below.

Steps in the pipeline briefly are:

[Lab] Device sequences blood sample and outputs sequence data to structured file of base pair sequences (often FASTQ format) and metadata file describing sequencing run. Sequence data ingested to HDFS.
[HDP/Hadoop] Primary analysis: sequence data at this point is structured is short segments that need to be aligned into chromosomal segments based on a reference genome. This is performed by a Spark-BWA model. Output is BAM file format saved to HDFS.
[HDP/Hadoop] Secondary analysis: base pairs that vary from the norm are identified and structured as location and variant in a VCF formatted file saved to HDFS. This is performed by a Spark GATK model.
[HDP/Hadoop] Tertiary analysis: predictions are made based on variants identified in previous step. Example here is disease risk. Input is VCF file and file with annotations that provide features (e.g. environmental exposure) for predictive model. Output is risk prediction represented as risk and probability, typically in simple csv format saved to HDFS.
[reporting] Simple csv is converted to consumable report by reporting engine.
[reporting] Report is archived and attached to email which is sent to clinician to advise on next steps for patient who provided sample in step 1.

This will be represented in Atlas search and lineage as below (which is elaborated in the rest of the article).

Demokit

The demokit repo provides instructions, which are quite simple: 1) set up a cluster (or sandbox), 2) on your local machine, export two environment variables and then run one script with no input params.

Running the demokit generates 5 such pipeline/lineage instances. If we do an unfiltered search on the gene_sequence type, for example, we get the results below. Clicking on the name of any search result allows a view of a single lineage as shown above.

Customized Atlas Entities in Genomic Analytics Pipeline/Lineage

The diagram below shows how customized types are represented in the pipeline/lineage.

The table that follows elaborates on each customized type.

Customized Type/ Entity	Entity represents: [platform]	Searchable Attributes	Article #
device	gene sequencing device [lab]	deviceDecomDate deviceId deviceImplemDate deviceMake deviceModel deviceType name	2
gene_sequence	raw sequence data ingested from device output [hadoop]	device (embedded, device) deviceQualifiedName name path runEndTime runReads runSampleId runStartTime runTechnician	2
model	models used in primary, secondary, tertiary analytics [hadoop]	deployDate deployHostDetail deployHostType deployObjSource modelDescription modelEndTime modelName modelOwnerLob modelRegistryUrl modelStartTime modelVersion name	1
report_engine	engine that generates report [reporting platform]	name reportEngHost reportEngRegistryUrl reportEngType reportEngVersion	3
report	generated report [reporting platform]	name reportEndTime reportFilename reportName reportStartTime reportStorageUrl reportVersion	3
email	email sent to doctor, with report attachment [reporting platform]	emailAttachment (embedded, report) emailBcc emailCc emailDate emailFrom emailSubject emailTo name	3

Atlas Search Examples

The following are examples of searches you can do against pipelines (sudocode here). Run the demokit and try examples yourself.

all pipelines where gene_sequence.technician=Wenwan_Jiao
all pipelines where email.emailTo=DrSmith@thehospital.com
all pipelines where gene_sequence.deviceQualifiedName contains 'iSeq100' (model of device)
all pipelines where model.modelName=genomics-HAIL and ModelStartTime >= '01/14/2019 12:00 AM' and model.modelStartTime <= '01/21/2019 12:00 AM'

Keep in mind that Atlas search can involve multiple constructs and can become quite complex. Search can be conducted from:

the UI as basic search (using the funnel icon is the most powerful)
the UI as advanced search (DSL)
RestAPI

Conclusion

I hope these articles have given you an appreciation for how easily customizable Atlas is to represent metadata and lineage across your data and infrastructure landscape, and how powerful it is to search against it.

Keep in mind that we have not even covered classification (tags), tag-based Ranger policies and business glossary. These additional capabilities cement Atlas as a powerful tool to understand and manage the growing and complex world of data you live in.

Atlas is an outstanding governance tool to understand and manage your data landscape at scale ... and to easily customize governance specifically to your needs while seemlessly integrating Atlas' out of the box search, lineage, classification and business glossary capabilities.

The only thing holding you back is your imagination 🙂

Cloudera Community