About gkeys

gkeys · ‎02-26-2019

Hi Jim, use the log4j library and there is a configuration to use an appender that defines how the logs rotate. Log4j is pretty standard in the java world Here is a good tutorial: https://www.journaldev.com/10689/log4j-tutorial

gkeys · ‎01-30-2019

Customizing Atlas: Summary of Work to Date Article Key Points Customized Type(s) Developed Part 1: Model governance, traceability and registry Quick primer on Atlas types, entities, attributes, lineage, search Quick primer on customizing Atlas Use the Atlas Rest API to customize any type, entity, attributes you wish Customizations integrate seamlessly with out of the box Atlas lineage and search capabilities Notes on operationalizing model to represent your deployed data science and complex Spark ETL models (what was deployed, which version, when, what are its concrete artifacts, etc) Part 2: Deep source metadata & embedded entities Use the Atlas Rest API to customize any type/entity/attributes you wish You can use a hyplerlinked entity (vs text) as a value to an attribute (embedded entity pattern) HDFS entities can hold deep metadata from source device to represent a hardware device (in this case gene sequencing device) gene_sequence to represent gene sequencing data landed in HDFS, as well as its source device and sequence run back in the lab Part 3: Lineage beyond Hadoop, including reports & emails Atlas Rest API can be sent from any networked system This allows metadata from that system to be pushed to Atlas This allows entities beyond Hadoop to be represented natively in Atlas Therefore, Atlas metadata, search and lineage can span across the data and infrastructure landscape report_engine to represent a report generating software deployment report to represent a report generated by the report engine email to represent an email that has been sent, including hyperlink to report entity as an email attachment Goals of this Article Goals of this article are to: Summarize: combine all of the previous article customizations and topics into a complex data pipeline/lineage example: genomic analytics pipeline from gene sequencing in the lab, multi-step genomic analytics on Hadoop, to report emailed to clinician Demokit: provide a single-command shell script that builds 5 such pipelines in Atlas, which then allows you to explore Atlas' powerful customization, search, lineage and general governance capabilities. The demokit is available at this github repo. Background: Genomic Analytics Pipeline A full genomic analytics pipeline is shown in the diagram below. Steps in the pipeline briefly are: [Lab] Device sequences blood sample and outputs sequence data to structured file of base pair sequences (often FASTQ format) and metadata file describing sequencing run. Sequence data ingested to HDFS. [HDP/Hadoop] Primary analysis: sequence data at this point is structured is short segments that need to be aligned into chromosomal segments based on a reference genome. This is performed by a Spark-BWA model. Output is BAM file format saved to HDFS. [HDP/Hadoop] Secondary analysis: base pairs that vary from the norm are identified and structured as location and variant in a VCF formatted file saved to HDFS. This is performed by a Spark GATK model. [HDP/Hadoop] Tertiary analysis: predictions are made based on variants identified in previous step. Example here is disease risk. Input is VCF file and file with annotations that provide features (e.g. environmental exposure) for predictive model. Output is risk prediction represented as risk and probability, typically in simple csv format saved to HDFS. [reporting] Simple csv is converted to consumable report by reporting engine. [reporting] Report is archived and attached to email which is sent to clinician to advise on next steps for patient who provided sample in step 1. This will be represented in Atlas search and lineage as below (which is elaborated in the rest of the article). Demokit The demokit repo provides instructions, which are quite simple: 1) set up a cluster (or sandbox), 2) on your local machine, export two environment variables and then run one script with no input params. Running the demokit generates 5 such pipeline/lineage instances. If we do an unfiltered search on the gene_sequence type, for example, we get the results below. Clicking on the name of any search result allows a view of a single lineage as shown above. Customized Atlas Entities in Genomic Analytics Pipeline/Lineage The diagram below shows how customized types are represented in the pipeline/lineage. The table that follows elaborates on each customized type. Customized Type/ Entity Entity represents: [platform] Searchable Attributes Article # device gene sequencing device [lab] deviceDecomDate deviceId deviceImplemDate deviceMake deviceModel deviceType name 2 gene_sequence raw sequence data ingested from device output [hadoop] device (embedded, device) deviceQualifiedName name path runEndTime runReads runSampleId runStartTime runTechnician 2 model models used in primary, secondary, tertiary analytics [hadoop] deployDate deployHostDetail deployHostType deployObjSource modelDescription modelEndTime modelName modelOwnerLob modelRegistryUrl modelStartTime modelVersion name 1 report_engine engine that generates report [reporting platform] name reportEngHost reportEngRegistryUrl reportEngType reportEngVersion 3 report generated report [reporting platform] name reportEndTime reportFilename reportName reportStartTime reportStorageUrl reportVersion 3 email email sent to doctor, with report attachment [reporting platform] emailAttachment (embedded, report) emailBcc emailCc emailDate emailFrom emailSubject emailTo name 3 Atlas Search Examples The following are examples of searches you can do against pipelines (sudocode here). Run the demokit and try examples yourself. all pipelines where gene_sequence.technician=Wenwan_Jiao all pipelines where email.emailTo=DrSmith@thehospital.com all pipelines where gene_sequence.deviceQualifiedName contains 'iSeq100' (model of device) all pipelines where model.modelName=genomics-HAIL and ModelStartTime >= '01/14/2019 12:00 AM' and model.modelStartTime <= '01/21/2019 12:00 AM' Keep in mind that Atlas search can involve multiple constructs and can become quite complex. Search can be conducted from: the UI as basic search (using the funnel icon is the most powerful) the UI as advanced search (DSL) RestAPI Conclusion I hope these articles have given you an appreciation for how easily customizable Atlas is to represent metadata and lineage across your data and infrastructure landscape, and how powerful it is to search against it. Keep in mind that we have not even covered classification (tags), tag-based Ranger policies and business glossary. These additional capabilities cement Atlas as a powerful tool to understand and manage the growing and complex world of data you live in. Atlas is an outstanding governance tool to understand and manage your data landscape at scale ... and to easily customize governance specifically to your needs while seemlessly integrating Atlas' out of the box search, lineage, classification and business glossary capabilities. The only thing holding you back is your imagination 🙂

gkeys · ‎12-27-2018

Introduction In Customizing Atlas (Part1): Model governance, traceability and registry I provided a brief overview of Atlas types and entities and showed how to customize them to fit your needs. I showed the specific example of a Model type used to govern your deployed data science models and complex Spark code. In Customizing Atlas (Part2): Deep source metadata and embedded entities I showed how to customize Atlas to hold knowledge of ingested data that goes deep beyond the data itself, e.g. detailes of the device that generated the data. I also showed how to implement the pattern of embedding an entity (not string) as an attribute value in your custom type. The result is a clickable hyperlink in the UI that opens that entity and its metadata. In this post I will: show how to extend your Atlas lineage to include processing and outputs on non-Hadoop systems represent the above as a single lineage that connects data in Hadoop to a reporting system which generates a report and sends an email with the report attached emphasize a key principle about Atlas: because its Rest API and customized types allow metadata to be sent and represented from any system, Atlas can centralize metadata from your entire data landscape. Concepts and Example Main Concept The main concepts is summarized below. The Atlas lineage includes processing and outputs on systems beyond Hadoop. Example: Reporting system The reporting system: inputs HDFS data outputs a report and archives it attaches the report in an email and sends the email This is shown in the digram below. Implementation ReportGenerator type is implemented as follows (one-time operation): curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["Process"], "name": "reportGenerator", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "inputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "outputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "reportGenRegistryUrl", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportGenVersion", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportGenType", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportGenHost", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }' ReportGenerator entity is implemented as follows: curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "reportGenerator", "attributes": { "qualifiedName": "reportProcessor-v2.4@reportserver.genomiccompany.com", "name": "disease-risk-report-v1.3", "inputs": [{"uniqueAttributes": {"qualifiedName": "/data/genomics/variants/sample-AB15423@prodCluster"}, "typeName": "hdfs_path"}], "outputs": [ {"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423.pdf"}, "typeName": "report"}, {"uniqueAttributes": {"qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432"}, "typeName": "email"} ], "reportGenRegistryUrl": "https://git@github.com/reportengines/genomics/predictive-general", "reportGenVersion": "2.4", "reportGenType": "variant-disease-risk", "reportGenHost": "reportserver.genomiccompany.com" } } ] }' Report type is implemented as follows (one-time operation): curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["DataSet"], "name": "report", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportVersion", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportFilename", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportStorageURL", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportStartTime", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "reportEndTime", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }' Report entity is implemented as follows: curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "report", "attributes": { "qualifiedName": "disease-risk-gen-variance@AB15423.pdf", "owner": "jobscheduler", "name": "disease-risk-gen-variance", "reportName": "genomics disease risk report - sample AB15423", "reportVersion": "1.1", "reportFilename": "genomics-disease-AB15423.pdf", "reportStorageURL": "s3://genomics-disease/AB15423.pdf", "reportStartTime": "2018-11-12T09:54:12.432Z", "reportEndTime": "2018-11-12T09:54:14.341Z" } } ] }' Email type is implemented as follows (one-time operation): curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["DataSet"], "name": "email", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailTo", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailFrom", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailCc", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailBcc", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailSubject", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailAttachments", "typeName": "array<DataSet>", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "emailDate", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }' Email entity is implemented as follows: curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "email", "attributes": { "qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432", "owner": "jobscheduler", "name": "email ", "emailTo": "drsmith@thehospital.com", "emailFrom": "me@genomicscompany.com", "emailCc": "archives@thehospital.com", "emailBcc": "", "emailSubject": "genomics disease risk report - patient AB15423", "emailAttachments": [{"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423_r-AA345744.pdf"}, "typeName": "report"}], "emailDate": "2018-11-12T09:54:14.000Z" } } ] }' Results in Atlas UI Lineage Lineage shows the HDFS input file and the processing and output on the reporting system. Report Generator When we click the blue gear we see full metadata on the reporting engine, including host machine and url to its artifacts (e.g. deployed binary, code, etc). This is shown in screenshot below. Report From the lineage when we click the report output, we see full metadata on the generated report, including filename, archive location and creation time. This is shown in screenshot below. Email From the lineage when we click email output, we see familiar information about an email, including to, from, cc, subject and date. This is shown in screenshot below. Note that the attachment field shows a clickable link to the report entity that is attached. Clicking this link leads the same report screen as shown above. Summary: What have we accomplished? We can: show a single lineage of data processing extending from Hadoop and continuing on non-Hadoop systems represent metadata of processing and outputs on non-Hadoop systems, including in the example here: report engines, emails, and reports The ideas here can be generalized for you to represent lineage and metadata of processing on any non-Hadoop system. Keep in mind also that you can continue the lineage to multiple systems both upstream and downstream from Hadoop, eg. external -> Hadoop -> external -> external. Key points: Atlas is hosted on Hadoop, but its Rest API allows you to send processing knowledge from system beyond Hadoop Customized Atlas types let you integrate external system knowledge natively in Atlas As a result, you can represent knowledge of your full data ecosystem in Atlas, including search and lineage. So ... go out and tame your data landscape with centralized metadata in Atlas that reaches well-beyond Hadoop! References Atlas Overview Using Apache Atlas on HDP3.0 Apache Atlas Altas Type System Atlas Rest API

gkeys · ‎12-24-2018

Introduction In the previous post Customizing Atlas (Part1): Model governance, traceability and registry we: provided a brief overview of Atlas types and entities showed how to customize Atlas types and entities to fit your own needs and appear in Atlas search and lineage customized a special type called model, which inherited from Process and empowered Atlas to govern the deployment of data science models commented on operationalizing custom entities In this post we will: customize an Atlas type to represent deep source metadata (metadata beyond source data itself) customize an Atlas type to represent devices (metadata about the actual device that generates data) embed the device entity in the deep source entity (make the device entity an attribute value in the deep source metadata) show how device as an attribute value is a clickable link in the Atlas UI that opens to the full device entity Concepts and Example Main Concepts The main concepts are summarized in the table below. Example: Gene sequence data ingest We will use the following scenario to represent these ideas. genomics company has multiple gene sequencing devices a technician conducts a run on the device, which outputs a blood sample's gene sequence which in turn is ingested to HDFS in Atlas, metadata for each device is instantiated as a device entity (Device type inherits from Infrastructure type) in Atlas, metadata for each gene sequence is instantiated as a gene_sequence entity (Gene_sequence type inherits from hdfs_path) each gene_sequence entity holds deep source metadata (in addition to metadata about the file on hdfs, also metadata about the device the generated the sequence and the specific run on the device, eg the technician's name) gene_sequence has an metadata attribute called device, which holds the actual device entity (not string) Implementation Device type is implemented as follows (one-time operation): ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["Infrastructure"], "name": "device", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "deviceId", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deviceType", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deviceModel", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deviceMake", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deviceImplemDate", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deviceDecomDate", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": true, "isIndexable": true } ] } ] }' Device entity is instantiated as follows: ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "device", "attributes": { "qualifiedName": "Illumina-iSeq100-1092454@gene_sequencer", "owner": "infra-group", "name": "Illumina-iSeq100-1092454", "deviceId": "1092454", "deviceType": "gene_sequencer", "deviceModel": "iSeq100", "deviceMake": "Illumina", "deviceImplemDate": "2018-08-21T19:49:24.000Z", "deviceDecomDate": "" } } ] }' Gene_sequence type is implemented as follows (one-time operation) ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["hdfs_path"], "name": "gene_sequence", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": true, "isOptional": false, "isIndexable": true }, { "name": "device", "typeName": "device", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deviceQualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "runSampleId", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "runReads", "typeName": "int", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "runStartTime", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "runEndTime", "typeName": "date", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "runTechnician", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }' Gene_sequence entity is instantiated as follows: ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "gene_sequence", "attributes": { "clusterName": "prod.genomicsanalytics.com", "isFile": "true", "fileSize": "2793872046", "createdBy": "nifi", "createTime": "2018-11-12T15:10:03.235Z", "qualifiedName": "/data/sequence-pipeline/device-output/AB12357@prod.genomicsanalytics.com", "owner": "jobscheduler", "name": "/data/sequence-pipeline/device-output/AB12357", "path": "hdfs://data/sequence-pipeline/device-output/AB12357", "device": {"uniqueAttributes": {"qualifiedName": "Illumina-iSeq100-1092454@gene_sequencer"}, "typeName": "device"}, "deviceQualifiedName": "Illumina-iSeq100-1092454@gene_sequencer", "runSampleId": "AB12357", "runReads": "9", "runTechnician": "Neeraj Gupta", "runStartTime": "2018-11-12T09:54:12.432Z", "runEndTime": "2018-11-12T15:09:59.351Z" } } ] } Results in Atlas UI Search for gene_sequence entities We can now search the gene_sequence type and see results (only one result in this example .. and I used the 'Columns' dropdown to customize the result columns). Notice the Device as a hyperlink. Drill down to all metadata of a single entity Let's first click the Name to get full list of metadata both inherited from hdfs_path and customized for gene_sequence. Note that we see the standard hdfs_path properties (like fileSize, path, etc ... for ease of development here, I did not fill in all the values.. this would be done on ingest to the cluster). We also see the device metadata and the run metadata. Drill down to device metadata If we click on the link to 'device' (from either of the two places in screenshots above) we see the following. A note on search and embedded entities You'll notice that our embedded customized entity 'device' does not show up in search: we cannot directly search by the attributes of an embedded entity (though we can search by on our customized attributes like RunStartTime that uses native Atlas types). This is the reason I have used the device qualified name (String) as an attribute Notice how it is constructed: <make>-<model>-<id>@<type>. This allows us to use the search construct 'contains' to search for all gene_sequence entities (ie data in hdfs) that match a device's make, model, id or type, or combination of these. Summary: What have we accomplished? This example For the example here, we know deep knowledge of any gene sequence landed into HDFS. We know: gene sequence HDFS path, file size, ingest time etc. which device generated the gene sequence data back in the lab details of the sample run that was sequenced on the device: technician's name, sample id, how long it took to run the sample, etc Generalizability The ideas here can be generalized for you to capture any source metadata that goes deeper than the data itself, and embed any custom entity as a clickable attribute value in another entity Use your imagination .. or rather, govern your data deeply. References Atlas Overview Using Apache Atlas on HDP3.0 Apache Atlas Altas Type System Atlas Rest API Related Previous Posts Customizing Atlas (Part1): Model governance, traceability and registry Generalized Framework to Deploy Models and Integrate Apache Atlas for Model Governance Acknowledgements Appreciation to @eorgad and @Hari rongali for awesome collaboration in the 2018 NE Hackathon which generated many ideas, including those in this post (and for taking first place!).

gkeys · ‎12-14-2018

Problem Statement: Deploying and Governing Models Machine Learning and Artificial Intelligence are in the process of exploding in importance and prevalence in the enterprise. With this explosive growth comes fundamental challenges in governing model deployments ... and doing this at scale. These challenges revolve around answering the following fundamental questions: Which models were deployed when? and to where? Was this deployment to a microservice, a Spark context on Hadoop, or other? What was the serialized object deployed? How can I find it? What version was deployed? Who is the owner? What is the larger context around the project How do I know the details of the model, ie. how do I trace the model in Production to its actual code, its training data, owner, etc? Previous article: Why and how you should use Atlas to govern your models Article: Customizing Atlas (Part1): Model governance, traceability and registry In the previous article I showed how Atlas is a powerful and natural fit for storing and searching model and deployment metadata. The main features of Atlas model metadata developed in the referenced article are searchable metadata of deployments of models searchable metadata of models that were deployed traceability of deployed models to a model registry that holds concrete model artifacts (code, training data, serialized model used in deployment, project README.md file, etc) data lineage for deployed models that transform data during data pipelines no lineage generated for models deployed in a request-response context like microservices which output predictions and have high throughput of data inputs This article: Generalized Framework to Deploy Models with Apache Atlas for Model Governance In this article, I present an overarching deployment framework that implements this Atlas governance of models and thus allows stakeholders to answer the above questions as the number of deployed models proliferate. Think prevalence of ML and AI one, two, five years from now. The Framework Personas The personas involved in the model deployment-governance framework are shown below with their actions. Model owner: stages model artifacts in a defined structure and provides an overview of the model and project in a Read.me file. Operations: launches automation that deploys the model, copies artifacts from staging to model registry and creates a Model entity in Atlas for this deployment Multiple stakeholders: (data scientist, data steward, compliance, production issue troubleshooters, etc) use Atlas to answer fundamental questions about deployed models and to access concrete artifacts of those models). Deployment-Governance Framework Details of the deployment-governance and person interactions with it are framework are shown below. Step 1: Model owner stages the model artifacts. This includes: code and training data README.md file describing project metadata.txt with key value pairs (model.name=<value>, model.type=<>, model.version=<>, model.description-<> .... serialized model for deployment (PMML, MLeap bundle, other) Step 2: operations deploys the model via an orchestrator automation. This automation: 2a: retrieves model artifacts from staging 2b: deploys serialized model 2c: copies artifacts to model repository (the automation orchestrator has been aggregating metadata from previous steps) 2d: creates new model entity in Atlas using aggregated metadata Step 3: use Atlas to understand deployed models result of deployment is Model entity created in Atlas (see Customizing Atlas (Part1): Model governance, traceability and registry for details) key capability is Atlas' powerful search techniques against metadata of deployed models, as shown in above diagram drill-down of model entity in search result provides understanding of deployment and model owner/project and provides traceability to concrete model artifacts in model registry Deployment-Governance Framework: Simple Implementation I show below how to implement the deployment framework. Important point: I have chosen the technologies shown below for a simple demonstration of the framework. Except for Atlas, technology implementations are your choice. For example, you could deploy your model to Spark on Hadoop instead of to a microservice, or you could use PMML instead of MLeap to serialize your model, etc. Important point summarized: This framework is a template and, except for Atlas, the technologies are swappable. Setting up your environment MLeap: follow the instuctions here to set up a dockerized MLeap Runtime http://mleap-docs.combust.ml/mleap-serving/ HDP: Create a HDP cluster sandbox using these instructions Atlas Model Type: When your HDP cluster is running, create your Atlas model type by running: #!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["Process"], "name": "model", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "inputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "outputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "deploy.datetime", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.host.type", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.host.detail", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.obj.source", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.version", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.type", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.description", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.owner.lob", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.registry.url", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }' See Customizing Atlas (Part1): Model governance, traceability and registry for details Running the framework See GitHub repo README.md for details on running: https://github.com/gregkeysquest/ModelDeployment-microservice Main points are shown below. Staging (Github) See repo https://github.com/gregkeysquest/Staging-ModelDeploy-v1.0 for details. Main points are: MLeap bundle (serialized model) is in path /executable the file modelMetadata.txt holds metadata about the model that will be pushed to Atlas model entity -- contents are shown below model.owner = Greg Keys model.owner.lob = pricing model.name = rental pricing prediction model.type = gradient boosting regression model.version = 1.1 model.description = model predicts monthly price of rental if property is purchased model.microservice.endpoint=target Orchestrator (Groovy calling shell scripts) The core code for the Groovy orchestrator is shown below //STEP 1: retrieve artifacts println "[STEP 1: retrieve artifacts] ..... downloading repo to tmp: repo=${repo} \n" processBuilder = new ProcessBuilder("shellScripts/fetchRepo.sh", repo, repoCreds, repoRoot).inheritIO().start().waitFor() //metadata aggregation println "[metadata aggregation] ..... gathering model metadata from repo \n " ModelMetadata.loadModelMetadata(repo,localRepo) //STEP 2: deploy serialized model def modelExecutable=new File("tmp/${repo}/executable").listFiles()[0].getName() println "[STEP 2: deploy serialized model] ..... deploying model to microservice: modelToDeploy=${modelExecutable} \n " processBuilder = new ProcessBuilder("shellScripts/deployModel.sh", repo, deployHostPort, modelExecutable).inheritIO().start().waitFor() //STEP 3: put artifacts to registry def modelRegistryPath="hdfs://${hdfsHostName}:8020${hdfsRegistryRoot}/${repo}" println "[STEP 3: put artifacts to registry] ..... copying tmp to model registry: modelRegistryPath=${modelRegistryPath} \n " processBuilder = new ProcessBuilder("shellScripts/pushToRegistry.sh", repo, modelRegistryPath, devMode.toString()).inheritIO().start().waitFor() //metadata aggregation println "[metadata aggregation] ..... gathering model deploy metadata \n " ModelMetadata.loadDeployMetadata(modelRegistryPath, modelExecutable, deployHostPort, deployHostType) //STEP 4: create Atlas model entity println "[STEP 4: create Atlas model entity] ..... deploying Atlas entity to ${atlasHost} \n " processBuilder = new ProcessBuilder("shellScripts/createAtlasModelEntity.sh", atlasCreds, atlasHost, ModelMetadata.deployQualifiedName, ModelMetadata.deployName, ModelMetadata.deployDateTime, ModelMetadata.deployEndPoint, ModelMetadata.deployHostType, ModelMetadata.modelExecutable, ModelMetadata.name, ModelMetadata.type, ModelMetadata.version, ModelMetadata.description, ModelMetadata.owner, ModelMetadata.ownerLob, ModelMetadata.registryURL ) Notice how the steps map directly to the Deployment-Governance Framework diagram above how metadata is processed and aggregated in two steps: one for model metadata and the other for deployment metadata Code for processing and aggregating metadata is shown here class ModelMetadata { static metadataFileLocation = "staging/modelMetadata.txt" static Properties props = null static repo = "" static owner = "" static ownerLob = "" static name = "" static type = "" static version = "" static description = "" static endpoint = "" static registryURL = "" static modelExecutable = "" static deployEndPoint = "" static deployHostType = "" static deployDateTime = "" static deployName = "" static deployQualifiedName = "" static void loadModelMetadata(repo, localRepo){ this.repo = repo props = new Properties() def input = new FileInputStream(localRepo +"/modelMetadata.txt") props.load(input) this.owner = props.getProperty("model.owner") this.ownerLob = props.getProperty("model.owner.lob") this.name = props.getProperty("model.name") this.type = props.getProperty("model.type") this.version = props.getProperty("model.version") this.description = props.getProperty("model.description") this.endpoint = props.getProperty("model.microservice.endpoint") } static loadDeployMetadata(modelRegistryPath, modelExecutable, deployHostPort, deployHostType) { this.deployDateTime = new Date().format('yyyy-MM-dd_HH:mm:ss', TimeZone.getTimeZone('EST'))+"EST" this.deployName = "${this.name} v${this.version}" this.deployQualifiedName = "${this.deployName}@${deployHostPort}".replace(' ', '-') this.registryURL=modelRegistryPath this.modelExecutable=modelExecutable this.deployEndPoint = "http://${deployHostPort}/${this.endpoint}" this.deployHostType = deployHostType } } Shell Scripts Each shell script that is called by the orchestrator is shown in the code blocks below Step 1: fetch staging (maps to 2a in diagram) #!/bin/bash # script name: fetchRepo.sh echo "calling fetchRepo.sh" REPO=$1 REPO_CRED=$2 REPO_ROOT=$3 # create tmp directory to store stagin cd tmp # fetch staging and unzip curl -u $REPO_CRED -L -o $REPO.zip http://github.com/$REPO_ROOT/$REPO/zipball/master/ unzip $REPO.zip # rename to simplify downstream processing mv ${REPO_ROOT}* $REPO # remove zip rm $REPO.zip echo "finished fetchRepo.sh" Step 2: deploy model (maps to 2b in diagram) #!/bin/bash # script name: deployModel.sh echo "starting deployModel.sh" REPO=$1 HOSTPORT=$2 EXECUTABLE=$3 # copy executable to staing to deploy to target echo "copying executable to load path with command: cp tmp/${REPO}/executable/* ../loadModel/" mkdir loadModel cp tmp/$REPO/executable/* loadModel/ # simplify special string characters Q="\"" SP="{" EP="}" # create json for curl string JSON_PATH="${SP}${Q}path${Q}:${Q}/models/${EXECUTABLE}${Q}${EP}" # create host for curl string URL="http://$HOSTPORT/model" # run curl string echo "running command: curl -XPUT -H \"content-type: application/json\" -d ${JSON_PATH} ${URL}" curl -XPUT -H "content-type: application/json" -d $JSON_PATH $URL echo "finished deployModel.sh" Step 3: copy staging to model repository (maps to 2c in diagram) #!/bin/bash # script name: pushToRegistry.sh ## Note: for ease of development their is a local mode to write to local file system instead of hdfs echo "calling pushToRegistry.sh" REPO_LOCAL=$1 HDFS_TARGET=$2 DEV_MODE=$3 cd tmp echo "copying localRepository=${REPO_LOCAL} to hdfs modelRegistryPath=${HDFS_TARGET}" if [ $DEV_MODE ]; then MOCK_REGISTRY="../mockedHDFSModelRegistry" echo "NOTE: in dev mode .. copying from ${REPO_LOCAL} to ${MOCK_REGISTRY}" mkdir $MOCK_REGISTRY cp -R $REPO_LOCAL $MOCK_REGISTRY/ else sudo hdfs -dfs cp $REPO_LOCAL $HDFS_TARGET fi echo "finished pushToRegistry.sh" Step 4: create Atlas model entity (maps to 2c in diagram) #!/bin/bash # script name: createAtlasModelEntity.sh echo "starting createAtlasModelEntity.sh" ATLAS_UU_PWD=$1 ATLAS_HOST=$2 echo "running command: curl -u ${ATLAS_UU_PWD} -ik -H \"Content-Type: application/json\" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d (ommitting json)" curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "'"${3}'"", "name": "'"${4}"'", "deploy.datetime": "'"${4}"'", "deploy.host.type": "'"${5}"'", "deploy.host.detail": "'"${6}"'", "deploy.obj.source": "'"${7}"'", "model.name": "'"${8}"'", "model.type": "'"${9}"'", "model.version": "1.1", "model.description": "'"${10}"'", "model.owner": "'"${11}"'", "model.owner.lob": "'"${12}"'", "model.registry.url": "'"${13}"'" } } ] }' echo "finished createAtlasModelEntity.sh" Summary: What have we accomplished? We have: designed a generalized deployment framework for models that integrates and leverages Atlas as a centralized governance tool for these deployments one key component is the orchestrator which aggregates metadata among process steps and then passes this to Atlas built upon the implementation and ideas developed in this previous article presented a simple implementation using technologies shown above Remember the key point that the deployment framework presented here is generalizable: except for Atlas you can plug in your choice of technologies for the orchestration, staging, model hosting and model repository, including elaborating the framework into a formal software development framework of your choice. References Customizing Atlas (Part1): Model governance, traceability and registry Atlas brief Atlas deep Groovy GitHub MLeap Acknowledgements Appreciation to the Hortonworks Data Science SME groups for their feedback on this idea. Particular appreciation to @Ian B and @Willie Engelbrecht for their deeper attention and interest.

gkeys · ‎12-12-2018

Problem Statement: Model governance Data science and model building are prevalent activities that bring new and innovative value to enterprises. The more prevalent this activity becomes, the more problematic model governance becomes. Model governance typically centers on these questions: What models were deployed? when? where? What was the serialized object deployed? Was deployment to a microservice, a Spark context, other? What version was deployed? How do I trace the deployed model to its concrete details: the code, its training data, owner, Read.me overview, etc? Apache Atlas is the central tool in organizing, searching and accessing metadata of data assets and processes on your Hadoop platform. Its Rest API can push metadata from anywhere, so Atlas can also represent metadata off your Hadoop cluster. Atlas lets you define your own types of objects and inherit from existing out-of-the box types. This lets you store whatever metadata you want to store, and to tie this into Atlas's powerful search, classification and taxonomy framework. In this article I show how to create a custom Model object (or more specifically 'type') to manage model deployments the same as you govern the rest of your data processes and assets using Atlas. This custom Model type lets you answer all of the above questions for any model you deploy. And ...it does so at scale while your data science or complex Spark transformation models explode in number, and you transform your business to enter the new data era. In a subsequent article I implement the Atlas work developed here into a larger model deployment framework: https://community.hortonworks.com/articles/229515/generalized-model-deployment-framework-with-apache.html A very brief primer on Atlas: types, entities, attributes, lineage and search Core Idea The below diagram represents the core concepts of Atlas: types, entities, attributes. (Let's save the ideas classification and taxonomy for another day). A type is an abstract representation of an asset. A type has a name and attributes that hold metadata on that asset. Entities are concrete instances of a type. For example, hive_table is a type that represents any hive_table in general. When you create an actual hive table, you will create a new hive_table entity in Atlas, with attributes like table name, owner, create time, columns, external vs managed, etc. Atlas comes out of the box with many types, and services like Hive have hooks to Atlas to auto-create and modify entities in Atlas. You can also create your own types (via the Atlas UI or Rest API). After this, you are in charge of instantiating entities ... which is easy to do via the RestAPI called from your job scheduler, deploy script or both. System Specific Types and Inheritance Atlas types are organized around the below inheritance model of types. Out of the box types like hive_table inherit from here and when you create customized types you should also. The most commonly used parent types in Atlas are DataSet (which represents any type and level of stored data) and Process (which represents transformation of data). Lineage Notice that Process has an attribute for an array of one or more input DataSets and another for output DataSets. This is how Process creates lineages of data processed to new data, as shown below. Search Now that Atlas is filled with types, entities and lineages ... how do you make sense of it all? Atlas has extremely powerful search constructs that let you find entities by attribute values (you can assemble AND/OR constructs among attributes of a type, using equals, contains, etc). And of course, anything performed on the UI can be done through the Rest API). Customizing Atlas for model governance My approach: I first review a customized Model type and then show how to implement it. Implementation comes in two steps: (1) create the custom Model type, and then (2) instantiate it with Model entities as they are deployed in your environment. I make a distinction between models that are (a) deployed on Hadoop in a data pipeline processing architecture (e.g. complex Spark transformation or data engineering models) and (b) deployed in a microservices or Machine Learning environment. In the first data lineage makes sense (there is a clear input, transformation, output pipeline) whereas in the second it does not (it is more of a request-response model with high throughput requests). I also show the implementation as hard-coded examples and then as an operational example where values are dynamic at deploy-time. In a subsequent article I implement the customized Model type in a fully automated model deployment and governance framework. Customized Atlas Type: Model The customized model type is shown in the diagram below. You can of course exclude shown attributes or include new ones as you feel appropriate for your needs. Key features are: deploy.: attributes starting with deploy describe metadata around the model deployment runtime deploy.datetime: the date and time the model was deployed deploy.host.type: type of hosting environment for deployed model (e.g. microservices, hadoop) deploy.host.detail: specifically where model was deployed (e.g. microservice endpoint, hadoop cluster) deploy.obj.source: location of serialized model that was deployed model. : attributes describing the model that was deployed (most self-explanatory) model.registry.url: provides traceability to model details; points to model registry holding model artifacts including code, training data, Read.me by owner, etc Step 1: Create customized model type (one-time operation) Use the Rest API by running the below curl command with json construct. #!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["Process"], "name": "model", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "inputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "outputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "deploy.datetime", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.host.type", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.host.detail", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.obj.source", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.version", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.type", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.description", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.owner.lob", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.registry.url", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }' Notice we are (a) using superType 'Process', (b) giving the type name 'model', and (c) creating new attributes in the same attributeDefs construct as those inherited by Process. Step 1 result When we go the Atlas UI we see the 'model' type listed with the other types, and we see the customized attribute fields in the Columns drop down. Step 2: Create model entity (each time you deploy model or new model version) Example 1: With lineage (for clear input/process/output processing of data) Notice two DataSets (type 'hdfs_path') are inputted to the model and one is outputted, as identified by their Atlas guid. #!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "model:disease-risk-HAIL-v2.8@ProdCluster", "name": "disease-risk-HAIL-v2.8", "deploy.datetime": "2018-12-05_15:26:41EST", "deploy.host.type": "hadoop", "deploy.host.detail": "ProdCluster", "deploy.obj.source": "hdfs://prod0.genomicscompany.com/model-registry/genomics/disease-risk-HAIL-v2.8/Docker", "model.name": "disease-risk-HAIL", "model.type": "Spark HAIL", "model.version": "2.8", "model.description": "disease risk prediction for sequenced blood sample", "model.owner": "Srinivas Kumar", "model.owner.lob": "genomic analytics group", "model.registry.url": "hdfs://prod0.genomicscompany.com/model-registry/genomics/disease-risk-HAIL-v2.8", "inputs": [ {"guid": "cf90bb6a-c946-48c8-aaff-a3b132a36620", "typeName": "hdfs_path"}, {"guid": "70d35ffc-5c64-4ec1-8c86-110b5bade70d", "typeName": "hdfs_path"} ], "outputs": [{"guid": "caab7a23-6b30-4c66-98f1-b2319841150e", "typeName": "hdfs_path"}] } } ] }' Example 2: No lineage (for request-response type of model eg. microservices or ML scoring) Similar to above, but no inputs and outputs specified. #!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "model:fraud-persloan-model-v1.1@https://service.bankcompany.com:6532/fraud", "name": "fraud-persloan-model-v1.1", "deploy.datetime": "2018-10-22_22:01:41EST", "deploy.host.type": "microservice", "deploy.host.detail": "https://service.bankcompany.com:6532/fraud", "deploy.obj.source": "hdfs://prod-nn.bankcompany.com/model-registry/personal-loans/fraud-persloan-model-v1.1/fraud.persloan.lr.zip", "model.name": "fraud-persloan-model", "model.type": "Spark ML Bayesian learning nn", "model.version": "1.1", "model.description": "fraud detection for personal loan application", "model.owner": "Beth Johnson", "model.owner.lob": "personal loans", "model.registry.url": "hdfs://prod-nn.bankcompany.com/model-registry/personal-loans/fraud-persloan-model-v1.1" } } ] }' Step 2 result Now we can search the 'model' type in the UI and see the results (below). When we click on 'disease-risk-HAIL-v2.8 we see the attribute values, and when we click on Relationships and then on a DataSet we see the lineage (below). After we click Relationships we see the image on left. After then clicking a DataSet in the relationship, we the lineage on right (below). For models deployed with no inputs and outputs values, the result is similar to above but no 'Relationships' nor 'Lineage' is created. A Note on Operationalizing The above entity creation was done using hardcoded values. However, in an operational environment these values will be created dynamically for each model deployment (entity creation). In this case the values are gathered by the orchestrator or deploy script, or both, and passed to the curl command. It will look something like this. #!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "model:'"${3}"'@'"${6}"'", "name": "'"${3}"'", "deploy.datetime": "'"${4}"'", "deploy.host.type": "'"${5}"'", "deploy.host.detail": "'"${6}"'", "deploy.obj.source": "'"${7}"'", "model.name": "'"${8}"'", "model.type": "'"${9}"'", "model.version": "'"${10}"'", "model.description": "'"${11}"'", "model.owner": "'"${12}"'", "model.owner.lob": "'"${13}"'", "model.registry.url": "'"${14}"'" } } ] }' Do notice the careful use of single and double quotes around each shell script variable above. The enclosing single quotes break and then reestablish the json string and the enclosing double quotes allows for spaces inside the variable values. Summary: What have we accomplished? We have: centralized the governance of data science models (and any complex Spark or other code) in the same Atlas metadata framework we use for the rest of our deployed systems leveraged Atlas search against model attributes to make sense of these model deployments traced a model's deployment to its concrete artifacts (code, training data, serialized deploy object, model owner Read.me file, etc) stored in a model registry Data science ... you have now been formally governed along with the rest of the data world 🙂 Crank out more models ... we'll take care of the rest! References Atlas Overview Using Apache Atlas on HDP3.0 Apache Atlas Altas Type System Atlas Rest API Misc HCC articles https://community.hortonworks.com/articles/136800/atlas-entitytag-attribute-based-searches.html https://community.hortonworks.com/articles/58932/understanding-taxonomy-in-apache-atlas.html https://community.hortonworks.com/articles/136784/intro-to-apache-atlas-tags-and-lineage.html https://community.hortonworks.com/articles/36121/using-apache-atlas-to-view-data-lineage.html https://community.hortonworks.com/articles/63468/atlas-rest-api-search-techniques.html (but v1 of API) https://community.hortonworks.com/articles/229220/adding-atlas-classification-tags-during-data-inges.html Implementing the ideas here into a model deployment framework: https://community.hortonworks.com/articles/229515/generalized-model-deployment-framework-with-apache.html Acknowledgements Appreciation to the Hortonworks Data Governance and Data Science SME groups for their feedback on this idea. Particular appreciation to @Ian B and @Willie Engelbrecht for their deep attention and interest.

gkeys · ‎11-21-2018

Introduction Naïve Bayes is a machine learning model that is simple and computationally light yet also accurate in classifying text as compared to more complex ML models. In this article I will use the Python scikit-learn libraries to develop the model. It is developed on a Zeppelin notebook on top of the Hortonworks Data Platform (HDP) and uses its %spark2.pyspark interpreter to run python on top of Spark. We will use a news feed to train the model to classify text. We will first build the basic model, then explore its data and attempt to improve the model. Finally, we will compare performance accuracy of all the models we develop. A note on code structuring: Python import statements are introduced when required by the code and not all at once upfront. This is to relate the packages closely to the code itself. The Zeppelin template for the full notebook can be obtained from here. A. Basic Model Get the data We will use news feed data that has been classified as either 1 (Business), 2 (Sports), 3 (Business) or 4 (Sci/Tech). The data is structured as CSV with fields: class, title, summary. (Note that later processing converts the labels to 0,1,2,3 respectively). Engineer the data The class, title and summary fields are appended to their own arrays. Data cleansing is done in the form of punctuation removal and conversion to lowercase. Convert the data The scikit-learn packages need these arrays to be represented as dataframes, which assign an int to each row of the array inside the data structure. Note: This first model will use summaries to classify test, as shown in the code. Vectorize the data Now we start using the machine learning packages. Here we convert the dataframes into a vector. The vector is a wide n x m matrix with n records and for each record m fields that hold a position for each word detected among all records, and the word frequency for that record and position. This is a sparse matrix since most m positions are not filled. We see from the output that there are 7600 records and 20027 words. The vector shown in the output is partial, showing part of the 0th record with word index positions 11624, 6794, 6996 etc. Fit (train) the model and determine its accuracy Let’s fit the model. Note that we split the data into a training set with 80% of the records and validation step with the remaining. Wow! 87% of our tests accurately predicted the text classification. That’s good. Deeper dive on results We can look more deeply than the single accuracy score reported above. One way is to generate a confusion matrix as shown below. The confusion matrix shows for any single true single class, the proportion of predictions it made against all predicted classes. We see that Sports text almost always predicted its classification correctly (0.96). Business and Sci/Tech were a bit more blurred: When Business text was incorrectly predicted, it was usually against Sci/Tech and the converse for Sci/Tech. This all makes sense since Sports vocabulary is quite distinctive and Sci/Tech is often in the Business news. There are other views of model outcomes ... check the sklearn.metrics api. Test the model with a single news feed summary Now take a single news feed summary and see what the model predicts. I have run may through the model and it performs quite well. The new text in shown above gets a clear Business classification. When I run news summaries on cultural items (no category in the model), the predictions are low and spread across all categories, as expected. B. Explore the data Word and anagram counts Let’s get the top 25 words and anagrams (phrases, in our case two words) among all training set text that were used to build the model. These are shown below. Hmm ... there are a lot of common meaningless words involved. Most of these are known as stopwords in natural language processing. Let’s remove the stop words and see if the model improves. Remove stopwords: retrieve list from file and fill array The above are stopwords from a file. The below allows you to iteratively add stopwords to the list as you explore the data. Word and anagram counts with stopwords removed Now we can see the top 25 words and anagrams after the stopwords are removed. Note how easy this is to do: we instantiate the CountVectorizer exactly as before, but by passing a reference to the stopword list. This is a good example of how powerful the skilearn libraries are: you interact with the high level apis and the dirty work is done under the surface. C. Try to improve the model Now we train the same model for news feed summaries with no stop words (left) and for titles and no stopwords. Interesting ... the model using summaries with no stop words is equally accurate as the one with them included in the text. Secondly, the titles model is less accurate than the summary model, but not by much (not bad for classifying text from samples of only 10-20 words). D. Comparison of model accuracies I trained each of the below models 5 times each: news feed text from summaries (no stops, with stops) and from titles (no stops, with stops). I averaged the accuracies and plotted as shown below. E. Conclusions Main Conclusions are: Naive Bayes effectively classified text particularly given small text sizes (news titles, news summaries of 1-4 sentences) Summaries classified with more accuracy than titles, but not by much considering how few attributes (words) are in a single title Removing stop words had no significant effect on model accuracy. This should not be surprising because stop words are expected to be represented equally among classifications Recommended model: summary with stop words retained, because it has highest accuracy and lower complexity of code and lower processing needs than the equally accurate summary with stop words removed model This model is susceptible to overfitting because the stories represent a window of time and its text (news words) accordingly is biased toward the “current” events at that time. F. References Zeppelin https://hortonworks.com/apache/zeppelin/ The zeppelin notebook for this article https://github.com/gregkeysquest/DataScience/tree/master/projects/naiveBayesTextClassifier/newsFeeds https://www.rescuetime.com/blocked/url/http://rss.cnn.com/rss/cnn_topstories.rss https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/ https://scikit-learn.org/stable/ https://matplotlib.org/

gkeys · ‎03-02-2018

Introduction Transparent Data Encryption (TDE) encrypts HDFS data on disk ("at rest"). It works thorough an interaction of multiple Hadoop components and security keys. Although the general idea may be easy to understand, major misconceptions can occur and the detailed mechanics and many acronyms can get confusing. Having an accurate and detailed understanding can be valuable, especially when working with security stakeholders who often serve as strict gatekeepers in allowing technology in the enterprise. Transparent Data Encryption: What it is and is not TDE encrypts HDFS data on disk in way that is transparent to the user, meaning the user accesses the HDFS data encrypted on disk identical to accessing non-encrypted data. No knowledge or implementation changes are needed on the client side and the user sees the data in its unencrypted form. TDE works at the file path or encryption zone level: all files written to designated paths (zones) are encrypted on disk. The goal of TDE is to prevent anyone who is inappropriately trying to access data from doing so. It guards against the threat, for example, of someone finding a disk in a dumpster or stealing one, or someone poking around HDFS who is not a user of the application or use case accessing the data. The goal of TDE is not to hide sensitive data elements from authorized users (e.g. masking social security numbers). Ranger policies for authorization, column masking and row-filtering is for that. Do note that Ranger policies like masking can be applied on top of TDE data because Ranger policies are implemented post-decryption. That again is the transparent part of TDE. The below chart provide a brief comparison of TDE with related approaches to data security. Data Security Approach Description Threat category Encrypted drives & Full disk encryption Contents of entire disk is encrypted. Access requires proper authentication key or password. Accessed data is unencrypted. Unintended access to hard drive contents. Transparent Data Encryption on Hadoop Individual HDFS files are encrypted. All files in designated paths (zones) are encrypted whereas those outside these paths are not. Access requires user, group or service inclusion in policy against the zone. Accessed data is unencrypted. Authorization includes read and/or write for zone. Unintended access to designated HDFS data on hard drive. Ranger access policies (e.g. authorization to HDFS paths, Hive column masking, row filtering, etc) No relevance to encryption. Access requires user or group inclusion in authorization policy. Accessed data is either full file, table, etc, masked columns or filtered rows etc. Authorization includes read, write, create table, etc. Viewing HDFS data or data elements that is inappropriate to the user (e.g social security number). Accessing TDE Data: High Level View In Hadoop, users access HDFS through a service (HDFS, Hive, etc). The service accesses HDFS data through its own HDFS client, which is abstracted from the user. For files written to or read from encryption zones, the service’s HDFS client contacts a Key Management Sevice (KMS) to validate the user and/or service read/write permissions to the zone and, if permitted, the KMS provides the master key to encrypt (write) or decrypt (read) the file. To decrypt a Hive table, for example, the encrypted zone would represent the path to the table data and the hive service would need read/write permissions to that zone. The KMS master key is called the Enterprise Zone Key (EZK) and encrypting /decrypting involves two additional keys and a detailed sequence events. This sequence is described below. Before that though, lets get to know the acronyms and technology components that are involved and then put all of the pieces together. One note on encryption zones: EZs can be nested within parent EZs, i.e. an EZ can hold encrypted files and an EZ that itself holds encrypted files and so on. (More on this later). Acronym Soup Let’s simply name the acronyms before we understand them. TDE: Transparent Data Encryption EZ: Encryption Zone KMS: Key Management Server (or Service). Ranger KMS is part of Hortonworks Stack. HSM: Hardware Security Module EZK: Encryption Zone Key DEK: Data Encryption Key EDEK: Encrypted Data Encryption Key The Puzzle Pieces In this section let’s just look at the pieces of the puzzle before putting them all together. Let’s first look at the three kinds of keys and how they map to the technology components on Hadoop that are involved in TDE. Key Types Three key types are involved in encrypting and decrypting files. Main idea is that each file in an EZ is encrypted and decrypted by a DEK. Each file has its own unique DEK. The DEK will be encrypted into an EDEK and decrypted back to the DEK by an EZK for the particular EZ the file belongs. Each zone has its own single EZK. (We will see that EZKs can be rotated, in which case the new EZK retains its name but increments a new version number). Hadoop Components Five Hadoop components are involved in TDE. Service HDFS client: Uses a file’s DEK to encrypt/decrypt the file. Name Node: Stores each file’s EDEK (encrypted DEK) in file’s metadata. Ranger KMS: Uses zone’s EZK to encrypt DEK to EDEKor decrypt EDEK to DEK keystore: stores each zone’s EZK Ranger UI: UI tool for special admin (keyadmin) to manage EZKs (create, name, rotate) and policies against the EZK (user, group, service that can use it it read and/or write files to EZ) Notes Ranger KMS and UI is part of the Hortonworks Hadoop stack but 3 rd party KMSs may be used. Keystore is Java keystore native to Ranger and thus software based. Compliance requirements like PCI required by financial institutions demand that keys are managed are stored in hardware (HSM) which is deemed more secure. Ranger KMS can integrate with SafeNet Luna HSM in this case. Implementing TDE: Create encryption zone Below are the steps to create an encryption zone. All files written to an encryption zone will be encrypted. First, the keyadmin user leverages the Ranger UI to create and name an EZK for a particular zone. Then the keyadmin applies a policy against that key and thus zone (who can read and/or write to the zone). The who here are actual services (e.g. Hive) and users and groups, which can be synced from an enterprise AD/LDAP instance. Next an hdfs superuser runs the hdfs commands shown above to instantiate the zone in HDFS. At this point, files can be transparently written/encrypted to the zone and read/decrypted from it. Recall the importance of the transparent aspect: applications (e.g. SQL client accessing Hive) need no special implementation to encrypt/decrypt to/from disk because that implementation is abstracted away from the application client. The details of how that works are shown next. Write Encrypted File (Part 1): generate file-level EDEK and store on Name Node When a file is written to an encryption zone an EDEK (which encrypts a DEK unique to the file) is created and stored in the Name Node metadata for that file. This EDEK will be involved with both writing the encrypted file and decrypting it when reading. When the service HDFS client tells the Name Node it wants to write a file to the EZ, the Name Node requests the KMS to return a unique DEK encrypted as and EDEK. The KMS does this by generating a unique DEK, pulling from the keystore the EZK for the files intended zone, and using this to encrypt the DEK to EDEK. This EDEK is returned to the NameNode and stored along with the rest of the file’s metadata. Write Encrypted File (Part 2): decrypt EDEK to use DEK for encrypting file Immediate next step to the above is for the service HDFS client to use the EDEK on the NameNode to expose its DEK to then encrypt the file to HDFS. The process is shown in diagram above. The KMs again uses the EZK to but this time to decrypt the EDEK to DEK. When the service HDFS client retrieves the DEK it uses it to encrypt the file blocks to HDFS. (These encrypted blocks are replicated as such). Read Encrypted File: decrypt EDEK and use DEK to decrypt file Reading the encrypted file runs the same process as writing (sending EDEK to KMS to return DEK). Now the DEK is used to decrypt the file. The decrypted file is returned to the user who transparently needs not know or do anything special in accessing the file in HDFS. The process is shown explicitly below. Miscellaneous Encrypt Existing Files in HDFS Encrypting existing files in HDFS cannot be done where they reside. The files must be copied to an encryption zone to do so (distcp is an effective way to do this, but other tools like NiFi or shell scripts with HDFS commands could be used). Deleting the original directory name and renaming the encryption zone to it leaves the end state identical to the beginning but with the files encrypted. Rolling Keys Enterprises often require keys to roll (replace) at intervals, typically 2 years or so. Rolling EZKs does NOT require existing encrypted files to be re-encrypting. It works as follows: Keyadmin updates EZK via Ranger UI. Name is retained and KMS increments key to new version. EDEKs on NameNodes are reencrypted (KMS decrypts EDEK with old key, resulting DEK encrypted with new key, resulting EDEK placed back on NameNode file metadata. All new writes and reads of existing files use new EZK. TDE performance benchmarking TDE does effect performance but it is not extreme. In general, writes are affected more than reads, which is good because performance SLAs are usually focused more on users reading data than users or batch processes writing it. Below is a representative benchmarking. Encrypting Intermediate Map-Reduce or Tez Data Map-reduce and Tez write data to disk as intermediate steps in a multi-step job (Tez less so than M-R and thus its faster performance). Encrypting this intermediate data can be done by setting the configurations shown in this link. Encrypting intermediate data will add greater performance overhead and is especially cautious since the data remains on disk for short durations of time and much less likely to be compromised compared to HDFS data. A Side Note on HBase Apache HBase has its own encryption at rest framework that is similar to HDFS TDE (see link). This is not officially supported on HDP. Instead, encryption at rest for HBase should use HDFS TDE as described in this article and specified here. References TDE general https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/ch_hdp-security-guide-hdfs-encryption.html https://hortonworks.com/blog/new-in-hdp-2-3-enterprise-grade-hdfs-data-at-rest-encryption/ slide https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html Ranger KMS https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/ranger-kms-admin-guide.html https://hadoop.apache.org/docs/stable/hadoop-kms/index.html TDE Implementation Quick Steps https://community.hortonworks.com/content/supportkb/49505/how-to-correctly-setup-the-hdfs-encryption-using-r.html https://community.hortonworks.com/content/kbentry/42227/using-transparent-data-encryption-in-hdfs.html TDE Best Practices https://community.hortonworks.com/questions/74758/what-are-the-best-practices-around-hdfs-transparen.html

gkeys · ‎02-13-2018

Thanks @Sreekanth Munigati .. that worked! s3a://demo/ does not work s3a://demo/folder does!

gkeys · ‎02-12-2018

Issue Issue I am having is this here, but setting the two configs are not working for me (seems it works for some and not others) https://forums.aws.amazon.com/message.jspa?messageID=768332 Below is full description Goal: I am writing this test query with small data size to output results to S3. INSERT OVERWRITE DIRECTORY 's3a://demo/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE select * from demo_table; Notes: This query runs when outputted to HDFS directory. I can create an external table locally against s3 remotely.. so my configurations are working (CREATE EXTERNAL TABLE ... LOCATION 's3a://demo/'; works) Only when outputing a query to s3 do I get a failure (below) Error: The error when attempting the query to output to S3 is: 2018-02-12 01:12:58,790 INFO [HiveServer2-Background-Pool: Thread-363]: log.PerfLogger (PerfLogger.java:PerfLogEnd(177)) - </PERFLOG method=releaseLocks start=1518397978790 end=1518397978790 duration=0 from=org.apache.hadoop.hive.ql.Driver> 2018-02-12 01:12:58,791 ERROR [HiveServer2-Background-Pool: Thread-363]: operation.Operation (SQLOperation.java:run(258)) - Error running hive query: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:324) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:199) at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:76) at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-02-12 01:12:58,794 INFO [HiveServer2-Handler-Pool: Thread-106]: session.HiveSessionImpl (HiveSessionImpl.java:acquireAfterOpLock(342)) - We are setting the hadoop caller context to 5e6f48a9-7014-4d15-b02c-579557b5fb98 for thread HiveServer2-Handler-Pool: Thread-106 Additional note: The query writes the tmp files to 's3a://demo/' but then fails with the above error. Tmp files look like [hdfs@gkeys0 centos]$ hdfs dfs -ls -R s3a://demo/ drwxrwxrwx - hdfs hdfs 0 2018-02-12 02:12 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1 drwxrwxrwx - hdfs hdfs 0 2018-02-12 02:12 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1/-ext-10000 -rw-rw-rw- 1 hdfs hdfs 38106 2018-02-12 02:09 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1/-ext-10000/000000_0 -rw-rw-rw- 1 hdfs hdfs 6570 2018-02-12 02:09 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1/-ext-10000/000001_0 Am I missing a config to set, or something like that?

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: NiFi: Easy custom logging of diverse sources i...

Customizing Atlas (Part4): Wrapping up with genomi...

Customizing Atlas (Part3): Lineage beyond Hadoop, ...

Customizing Atlas (Part2): Deep source metadata & ...

Generalized Framework to Deploy Models and Integra...

Customizing Atlas (Part1): Model governance, trace...

Naive Bayes ML to classify text (Part 1): Zeppelin...

Transparent Data Encryption Explained: High-Level ...

Re: Hive query on prem writing to S3 fails bc of r...

Hive query on prem writing to S3 fails bc of retur...