Created on 12-12-2018 05:08 PM - edited 09-16-2022 01:44 AM
Data science and model building are prevalent activities that bring new and innovative value to enterprises. The more prevalent this activity becomes, the more problematic model governance becomes. Model governance typically centers on these questions:
Apache Atlas is the central tool in organizing, searching and accessing metadata of data assets and processes on your Hadoop platform. Its Rest API can push metadata from anywhere, so Atlas can also represent metadata off your Hadoop cluster.
Atlas lets you define your own types of objects and inherit from existing out-of-the box types. This lets you store whatever metadata you want to store, and to tie this into Atlas's powerful search, classification and taxonomy framework.
In this article I show how to create a custom Model object (or more specifically 'type') to manage model deployments the same as you govern the rest of your data processes and assets using Atlas. This custom Model type lets you answer all of the above questions for any model you deploy. And ...it does so at scale while your data science or complex Spark transformation models explode in number, and you transform your business to enter the new data era.
In a subsequent article I implement the Atlas work developed here into a larger model deployment framework: https://community.hortonworks.com/articles/229515/generalized-model-deployment-framework-with-apache...
The below diagram represents the core concepts of Atlas: types, entities, attributes. (Let's save the ideas classification and taxonomy for another day).
A type is an abstract representation of an asset. A type has a name and attributes that hold metadata on that asset. Entities are concrete instances of a type. For example, hive_table is a type that represents any hive_table in general. When you create an actual hive table, you will create a new hive_table entity in Atlas, with attributes like table name, owner, create time, columns, external vs managed, etc.
Atlas comes out of the box with many types, and services like Hive have hooks to Atlas to auto-create and modify entities in Atlas. You can also create your own types (via the Atlas UI or Rest API). After this, you are in charge of instantiating entities ... which is easy to do via the RestAPI called from your job scheduler, deploy script or both.
Atlas types are organized around the below inheritance model of types. Out of the box types like hive_table inherit from here and when you create customized types you should also.
The most commonly used parent types in Atlas are DataSet (which represents any type and level of stored data) and Process (which represents transformation of data).
Notice that Process has an attribute for an array of one or more input DataSets and another for output DataSets. This is how Process creates lineages of data processed to new data, as shown below.
Now that Atlas is filled with types, entities and lineages ... how do you make sense of it all?
Atlas has extremely powerful search constructs that let you find entities by attribute values (you can assemble AND/OR constructs among attributes of a type, using equals, contains, etc). And of course, anything performed on the UI can be done through the Rest API).
My approach: I first review a customized Model type and then show how to implement it. Implementation comes in two steps: (1) create the custom Model type, and then (2) instantiate it with Model entities as they are deployed in your environment.
I make a distinction between models that are (a) deployed on Hadoop in a data pipeline processing architecture (e.g. complex Spark transformation or data engineering models) and (b) deployed in a microservices or Machine Learning environment. In the first data lineage makes sense (there is a clear input, transformation, output pipeline) whereas in the second it does not (it is more of a request-response model with high throughput requests).
I also show the implementation as hard-coded examples and then as an operational example where values are dynamic at deploy-time. In a subsequent article I implement the customized Model type in a fully automated model deployment and governance framework.
The customized model type is shown in the diagram below. You can of course exclude shown attributes or include new ones as you feel appropriate for your needs.
Key features are:
Use the Rest API by running the below curl command with json construct.
#!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{ "enumDefs": [], "structDefs": [], "classificationDefs": [], "entityDefs": [ { "superTypes": ["Process"], "name": "model", "typeVersion": "1.0", "attributeDefs": [ { "name": "qualifiedName", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "inputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "outputs", "typeName": "array<DataSet>", "isOptional": true, "cardinality": "SET", "valuesMinCount": 0, "valuesMaxCount": 2147483647, "isUnique": false, "isIndexable": false, "includeInNotification": false }, { "name": "deploy.datetime", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.host.type", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.host.detail", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "deploy.obj.source", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.name", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.version", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.type", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.description", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.owner", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.owner.lob", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true }, { "name": "model.registry.url", "typeName": "string", "cardinality": "SINGLE", "isUnique": false, "isOptional": false, "isIndexable": true } ] } ] }'
Notice we are (a) using superType 'Process', (b) giving the type name 'model', and (c) creating new attributes in the same attributeDefs construct as those inherited by Process.
When we go the Atlas UI we see the 'model' type listed with the other types, and we see the customized attribute fields in the Columns drop down.
Example 1: With lineage (for clear input/process/output processing of data)
Notice two DataSets (type 'hdfs_path') are inputted to the model and one is outputted, as identified by their Atlas guid.
#!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "model:disease-risk-HAIL-v2.8@ProdCluster", "name": "disease-risk-HAIL-v2.8", "deploy.datetime": "2018-12-05_15:26:41EST", "deploy.host.type": "hadoop", "deploy.host.detail": "ProdCluster", "deploy.obj.source": "hdfs://prod0.genomicscompany.com/model-registry/genomics/disease-risk-HAIL-v2.8/Docker", "model.name": "disease-risk-HAIL", "model.type": "Spark HAIL", "model.version": "2.8", "model.description": "disease risk prediction for sequenced blood sample", "model.owner": "Srinivas Kumar", "model.owner.lob": "genomic analytics group", "model.registry.url": "hdfs://prod0.genomicscompany.com/model-registry/genomics/disease-risk-HAIL-v2.8", "inputs": [ {"guid": "cf90bb6a-c946-48c8-aaff-a3b132a36620", "typeName": "hdfs_path"}, {"guid": "70d35ffc-5c64-4ec1-8c86-110b5bade70d", "typeName": "hdfs_path"} ], "outputs": [{"guid": "caab7a23-6b30-4c66-98f1-b2319841150e", "typeName": "hdfs_path"}] } } ] }'
Example 2: No lineage (for request-response type of model eg. microservices or ML scoring)
Similar to above, but no inputs and outputs specified.
#!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "model:fraud-persloan-model-v1.1@https://service.bankcompany.com:6532/fraud", "name": "fraud-persloan-model-v1.1", "deploy.datetime": "2018-10-22_22:01:41EST", "deploy.host.type": "microservice", "deploy.host.detail": "https://service.bankcompany.com:6532/fraud", "deploy.obj.source": "hdfs://prod-nn.bankcompany.com/model-registry/personal-loans/fraud-persloan-model-v1.1/fraud.persloan.lr.zip", "model.name": "fraud-persloan-model", "model.type": "Spark ML Bayesian learning nn", "model.version": "1.1", "model.description": "fraud detection for personal loan application", "model.owner": "Beth Johnson", "model.owner.lob": "personal loans", "model.registry.url": "hdfs://prod-nn.bankcompany.com/model-registry/personal-loans/fraud-persloan-model-v1.1" } } ] }'
Now we can search the 'model' type in the UI and see the results (below).
When we click on 'disease-risk-HAIL-v2.8 we see the attribute values, and when we click on Relationships and then on a DataSet we see the lineage (below).
After we click Relationships we see the image on left. After then clicking a DataSet in the relationship, we the lineage on right (below).
For models deployed with no inputs and outputs values, the result is similar to above but no 'Relationships' nor 'Lineage' is created.
The above entity creation was done using hardcoded values. However, in an operational environment these values will be created dynamically for each model deployment (entity creation). In this case the values are gathered by the orchestrator or deploy script, or both, and passed to the curl command. It will look something like this.
#!/bin/bash ATLAS_UU_PWD=$1 ATLAS_HOST=$2 curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{ "entities": [ { "typeName": "model", "attributes": { "qualifiedName": "model:'"${3}"'@'"${6}"'", "name": "'"${3}"'", "deploy.datetime": "'"${4}"'", "deploy.host.type": "'"${5}"'", "deploy.host.detail": "'"${6}"'", "deploy.obj.source": "'"${7}"'", "model.name": "'"${8}"'", "model.type": "'"${9}"'", "model.version": "'"${10}"'", "model.description": "'"${11}"'", "model.owner": "'"${12}"'", "model.owner.lob": "'"${13}"'", "model.registry.url": "'"${14}"'" } } ] }'
Do notice the careful use of single and double quotes around each shell script variable above. The enclosing single quotes break and then reestablish the json string and the enclosing double quotes allows for spaces inside the variable values.
We have:
Data science ... you have now been formally governed along with the rest of the data world 🙂
Crank out more models ... we'll take care of the rest!
Appreciation to the Hortonworks Data Governance and Data Science SME groups for their feedback on this idea. Particular appreciation to @Ian B and @Willie Engelbrecht for their deep attention and interest.