Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Guru

Customizing Atlas: Summary of Work to Date

ArticleKey PointsCustomized Type(s) Developed
Part 1: Model governance, traceability and registry
  • Quick primer on Atlas types, entities, attributes, lineage, search
  • Quick primer on customizing Atlas
  • Use the Atlas Rest API to customize any type, entity, attributes you wish
  • Customizations integrate seamlessly with out of the box Atlas lineage and search capabilities
  • Notes on operationalizing
  • model to represent your deployed data science and complex Spark ETL models (what was deployed, which version, when, what are its concrete artifacts, etc)
Part 2: Deep source metadata & embedded entities
  • Use the Atlas Rest API to customize any type/entity/attributes you wish
  • You can use a hyplerlinked entity (vs text) as a value to an attribute (embedded entity pattern)
  • HDFS entities can hold deep metadata from source
  • device to represent a hardware device (in this case gene sequencing device)
  • gene_sequence to represent gene sequencing data landed in HDFS, as well as its source device and sequence run back in the lab
Part 3: Lineage beyond Hadoop, including reports & emails
  • Atlas Rest API can be sent from any networked system
  • This allows metadata from that system to be pushed to Atlas
  • This allows entities beyond Hadoop to be represented natively in Atlas
  • Therefore, Atlas metadata, search and lineage can span across the data and infrastructure landscape
  • report_engine to represent a report generating software deployment
  • report to represent a report generated by the report engine
  • email to represent an email that has been sent, including hyperlink to report entity as an email attachment

Goals of this Article

Goals of this article are to:

  • Summarize: combine all of the previous article customizations and topics into a complex data pipeline/lineage example: genomic analytics pipeline from gene sequencing in the lab, multi-step genomic analytics on Hadoop, to report emailed to clinician
  • Demokit: provide a single-command shell script that builds 5 such pipelines in Atlas, which then allows you to explore Atlas' powerful customization, search, lineage and general governance capabilities. The demokit is available at this github repo.

Background: Genomic Analytics Pipeline

A full genomic analytics pipeline is shown in the diagram below.100393-hcc4-pipeline.png

Steps in the pipeline briefly are:

  1. [Lab] Device sequences blood sample and outputs sequence data to structured file of base pair sequences (often FASTQ format) and metadata file describing sequencing run. Sequence data ingested to HDFS.
  2. [HDP/Hadoop] Primary analysis: sequence data at this point is structured is short segments that need to be aligned into chromosomal segments based on a reference genome. This is performed by a Spark-BWA model. Output is BAM file format saved to HDFS.
  3. [HDP/Hadoop] Secondary analysis: base pairs that vary from the norm are identified and structured as location and variant in a VCF formatted file saved to HDFS. This is performed by a Spark GATK model.
  4. [HDP/Hadoop] Tertiary analysis: predictions are made based on variants identified in previous step. Example here is disease risk. Input is VCF file and file with annotations that provide features (e.g. environmental exposure) for predictive model. Output is risk prediction represented as risk and probability, typically in simple csv format saved to HDFS.
  5. [reporting] Simple csv is converted to consumable report by reporting engine.
  6. [reporting] Report is archived and attached to email which is sent to clinician to advise on next steps for patient who provided sample in step 1.

This will be represented in Atlas search and lineage as below (which is elaborated in the rest of the article).

100395-hcc4-lineage.png

Demokit

The demokit repo provides instructions, which are quite simple: 1) set up a cluster (or sandbox), 2) on your local machine, export two environment variables and then run one script with no input params.

Running the demokit generates 5 such pipeline/lineage instances. If we do an unfiltered search on the gene_sequence type, for example, we get the results below. Clicking on the name of any search result allows a view of a single lineage as shown above.

100397-demokit-search-geneseq.png

Customized Atlas Entities in Genomic Analytics Pipeline/Lineage

The diagram below shows how customized types are represented in the pipeline/lineage.

99472-hcc4-lineage-annotated-2.png

The table that follows elaborates on each customized type.

Customized Type/
Entity
Entity represents: [platform]Searchable AttributesArticle
#
devicegene sequencing device [lab]deviceDecomDate
deviceId
deviceImplemDate
deviceMake
deviceModel
deviceType
name
2
gene_sequence
raw sequence data ingested from device output [hadoop]device (embedded, device)
deviceQualifiedName
name
path
runEndTime
runReads
runSampleId
runStartTime
runTechnician
2
modelmodels used in primary, secondary, tertiary analytics [hadoop]deployDate
deployHostDetail
deployHostType
deployObjSource
modelDescription
modelEndTime
modelName
modelOwnerLob
modelRegistryUrl
modelStartTime
modelVersion
name
1
report_engineengine that generates report [reporting platform]name
reportEngHost
reportEngRegistryUrl
reportEngType
reportEngVersion
3
reportgenerated report [reporting platform]name
reportEndTime
reportFilename
reportName
reportStartTime
reportStorageUrl
reportVersion
3
emailemail sent to doctor, with report attachment [reporting platform]emailAttachment (embedded, report)
emailBcc
emailCc
emailDate
emailFrom
emailSubject
emailTo
name

3

Atlas Search Examples

The following are examples of searches you can do against pipelines (sudocode here). Run the demokit and try examples yourself.

  • all pipelines where gene_sequence.technician=Wenwan_Jiao
  • all pipelines where email.emailTo=DrSmith@thehospital.com
  • all pipelines where gene_sequence.deviceQualifiedName contains 'iSeq100' (model of device)
  • all pipelines where model.modelName=genomics-HAIL and ModelStartTime >= '01/14/2019 12:00 AM' and model.modelStartTime <= '01/21/2019 12:00 AM'

Keep in mind that Atlas search can involve multiple constructs and can become quite complex. Search can be conducted from:

  • the UI as basic search (using the funnel icon is the most powerful)
  • the UI as advanced search (DSL)
  • RestAPI

Conclusion

I hope these articles have given you an appreciation for how easily customizable Atlas is to represent metadata and lineage across your data and infrastructure landscape, and how powerful it is to search against it.

Keep in mind that we have not even covered classification (tags), tag-based Ranger policies and business glossary. These additional capabilities cement Atlas as a powerful tool to understand and manage the growing and complex world of data you live in.

Atlas is an outstanding governance tool to understand and manage your data landscape at scale ... and to easily customize governance specifically to your needs while seemlessly integrating Atlas' out of the box search, lineage, classification and business glossary capabilities.

The only thing holding you back is your imagination


hcc4-lineage-annotated.png
511 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 04:52 AM
Updated by:
 
Contributors
Top Kudoed Authors