Atlas Rest API can be sent from any networked system
This allows metadata from that system to be pushed to Atlas
This allows entities beyond Hadoop to be represented natively in Atlas
Therefore, Atlas metadata, search and lineage can span across the data and infrastructure landscape
report_engine to represent a report generating software deployment
report to represent a report generated by the report engine
email to represent an email that has been sent, including hyperlink to report entity as an email attachment
Goals of this Article
Goals of this article are to:
Summarize: combine all of the previous article customizations and topics into a complex data pipeline/lineage example: genomic analytics pipeline from gene sequencing in the lab, multi-step genomic analytics on Hadoop, to report emailed to clinician
Demokit: provide a single-command shell script that builds 5 such pipelines in Atlas, which then allows you to explore Atlas' powerful customization, search, lineage and general governance capabilities. The demokit is available at this github repo.
Background: Genomic Analytics Pipeline
A full genomic analytics pipeline is shown in the diagram below.
Steps in the pipeline briefly are:
[Lab] Device sequences blood sample and outputs sequence data to structured file of base pair sequences (often FASTQ format) and metadata file describing sequencing run. Sequence data ingested to HDFS.
[HDP/Hadoop] Primary analysis: sequence data at this point is structured is short segments that need to be aligned into chromosomal segments based on a reference genome. This is performed by a Spark-BWA model. Output is BAM file format saved to HDFS.
[HDP/Hadoop] Secondary analysis: base pairs that vary from the norm are identified and structured as location and variant in a VCF formatted file saved to HDFS. This is performed by a Spark GATK model.
[HDP/Hadoop] Tertiary analysis: predictions are made based on variants identified in previous step. Example here is disease risk. Input is VCF file and file with annotations that provide features (e.g. environmental exposure) for predictive model. Output is risk prediction represented as risk and probability, typically in simple csv format saved to HDFS.
[reporting] Simple csv is converted to consumable report by reporting engine.
[reporting] Report is archived and attached to email which is sent to clinician to advise on next steps for patient who provided sample in step 1.
This will be represented in Atlas search and lineage as below (which is elaborated in the rest of the article).
The demokit repo provides instructions, which are quite simple: 1) set up a cluster (or sandbox), 2) on your local machine, export two environment variables and then run one script with no input params.
Running the demokit generates 5 such pipeline/lineage instances. If we do an unfiltered search on the gene_sequence type, for example, we get the results below. Clicking on the name of any search result allows a view of a single lineage as shown above.
Customized Atlas Entities in Genomic Analytics Pipeline/Lineage
The diagram below shows how customized types are represented in the pipeline/lineage.
The table that follows elaborates on each customized type.
Customized Type/ Entity
Entity represents: [platform]
gene sequencing device [lab]
deviceDecomDate deviceId deviceImplemDate deviceMake deviceModel deviceType name
raw sequence data ingested from device output [hadoop]
name reportEngHost reportEngRegistryUrl reportEngType reportEngVersion
generated report [reporting platform]
name reportEndTime reportFilename reportName reportStartTime reportStorageUrl reportVersion
email sent to doctor, with report attachment [reporting platform]
emailAttachment (embedded, report) emailBcc emailCc emailDate emailFrom emailSubject emailTo name
Atlas Search Examples
The following are examples of searches you can do against pipelines (sudocode here). Run the demokit and try examples yourself.
all pipelines where gene_sequence.technician=Wenwan_Jiao
all pipelines where email.emailTo=DrSmith@thehospital.com
all pipelines where gene_sequence.deviceQualifiedName contains 'iSeq100' (model of device)
all pipelines where model.modelName=genomics-HAIL and ModelStartTime >= '01/14/2019 12:00 AM' and model.modelStartTime <= '01/21/2019 12:00 AM'
Keep in mind that Atlas search can involve multiple constructs and can become quite complex. Search can be conducted from:
the UI as basic search (using the funnel icon is the most powerful)
the UI as advanced search (DSL)
I hope these articles have given you an appreciation for how easily customizable Atlas is to represent metadata and lineage across your data and infrastructure landscape, and how powerful it is to search against it.
Keep in mind that we have not even covered classification (tags), tag-based Ranger policies and business glossary. These additional capabilities cement Atlas as a powerful tool to understand and manage the growing and complex world of data you live in.
Atlas is an outstanding governance tool to understand and manage your data landscape at scale ... and to easily customize governance specifically to your needs while seemlessly integrating Atlas' out of the box search, lineage, classification and business glossary capabilities.
The only thing holding you back is your imagination