Member since
06-20-2016
488
Posts
432
Kudos Received
118
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3071 | 08-25-2017 03:09 PM | |
1924 | 08-22-2017 06:52 PM | |
3338 | 08-09-2017 01:10 PM | |
7967 | 08-04-2017 02:34 PM | |
8005 | 08-01-2017 11:35 AM |
02-26-2019
04:34 PM
Hi Jim, use the log4j library and there is a configuration to use an appender that defines how the logs rotate. Log4j is pretty standard in the java world Here is a good tutorial: https://www.journaldev.com/10689/log4j-tutorial
... View more
01-30-2019
07:21 PM
4 Kudos
Customizing Atlas: Summary of Work to Date Article Key Points Customized Type(s) Developed Part 1: Model governance, traceability and registry
Quick primer on Atlas types, entities, attributes, lineage, search Quick primer on customizing Atlas Use the Atlas Rest API to customize any type, entity, attributes you wish Customizations integrate seamlessly with out of the box Atlas lineage and search capabilities Notes on operationalizing
model to represent your deployed data science and complex Spark ETL models (what was deployed, which version, when, what are its concrete artifacts, etc) Part 2: Deep source metadata & embedded entities
Use the Atlas Rest API to customize any type/entity/attributes you wish You can use a hyplerlinked entity (vs text) as a value to an attribute (embedded entity pattern) HDFS entities can hold deep metadata from source
device to represent a hardware device (in this case gene sequencing device) gene_sequence to represent gene sequencing data landed in HDFS, as well as its source device and sequence run back in the lab Part 3: Lineage beyond Hadoop, including reports & emails
Atlas Rest API can be sent from any networked system This allows metadata from that system to be pushed to Atlas This allows entities beyond Hadoop to be represented natively in Atlas Therefore, Atlas metadata, search and lineage can span across the data and infrastructure landscape
report_engine to represent a report generating software deployment report to represent a report generated by the report engine email to represent an email that has been sent, including hyperlink to report entity as an email attachment Goals of this Article Goals of this article are to:
Summarize: combine all of the previous article customizations and topics into a complex data pipeline/lineage example: genomic analytics pipeline from gene sequencing in the lab, multi-step genomic analytics on Hadoop, to report emailed to clinician Demokit: provide a single-command shell script that builds 5 such pipelines in Atlas, which then allows you to explore Atlas' powerful customization, search, lineage and general governance capabilities. The demokit is available at this github repo. Background: Genomic Analytics Pipeline A full genomic analytics pipeline is shown in the diagram below. Steps in the pipeline briefly are: [Lab] Device sequences blood sample and outputs sequence data to structured file of base pair sequences (often FASTQ format) and metadata file describing sequencing run. Sequence data ingested to HDFS. [HDP/Hadoop] Primary analysis: sequence data at this point is structured is short segments that need to be aligned into chromosomal segments based on a reference genome. This is performed by a Spark-BWA model. Output is BAM file format saved to HDFS. [HDP/Hadoop] Secondary analysis: base pairs that vary from the norm are identified and structured as location and variant in a VCF formatted file saved to HDFS. This is performed by a Spark GATK model. [HDP/Hadoop] Tertiary analysis: predictions are made based on variants identified in previous step. Example here is disease risk. Input is VCF file and file with annotations that provide features (e.g. environmental exposure) for predictive model. Output is risk prediction represented as risk and probability, typically in simple csv format saved to HDFS. [reporting] Simple csv is converted to consumable report by reporting engine. [reporting] Report is archived and attached to email which is sent to clinician to advise on next steps for patient who provided sample in step 1. This will be represented in Atlas search and lineage as below (which is elaborated in the rest of the article). Demokit The demokit repo provides instructions, which are quite simple: 1) set up a cluster (or sandbox), 2) on your local machine, export two environment variables and then run one script with no input params. Running the demokit generates 5 such pipeline/lineage instances. If we do an unfiltered search on the gene_sequence type, for example, we get the results below. Clicking on the name of any search result allows a view of a single lineage as shown above. Customized Atlas Entities in Genomic Analytics Pipeline/Lineage The diagram below shows how customized types are represented in the pipeline/lineage. The table that follows elaborates on each customized type. Customized Type/ Entity Entity represents: [platform] Searchable Attributes Article # device gene sequencing device [lab] deviceDecomDate deviceId deviceImplemDate deviceMake deviceModel deviceType name 2 gene_sequence raw sequence data ingested from device output [hadoop] device (embedded, device) deviceQualifiedName name path runEndTime runReads runSampleId runStartTime runTechnician 2 model models used in primary, secondary, tertiary analytics [hadoop] deployDate deployHostDetail deployHostType deployObjSource modelDescription modelEndTime modelName modelOwnerLob modelRegistryUrl modelStartTime modelVersion name 1 report_engine engine that generates report [reporting platform] name reportEngHost reportEngRegistryUrl reportEngType reportEngVersion 3 report generated report [reporting platform] name reportEndTime reportFilename reportName reportStartTime reportStorageUrl reportVersion 3 email email sent to doctor, with report attachment [reporting platform] emailAttachment (embedded, report) emailBcc emailCc emailDate emailFrom emailSubject emailTo name 3 Atlas Search Examples The following are examples of searches you can do against pipelines (sudocode here). Run the demokit and try examples yourself.
all pipelines where gene_sequence.technician=Wenwan_Jiao all pipelines where email.emailTo=DrSmith@thehospital.com all pipelines where gene_sequence.deviceQualifiedName contains 'iSeq100' (model of device) all pipelines where model.modelName=genomics-HAIL and ModelStartTime >= '01/14/2019 12:00 AM' and model.modelStartTime <= '01/21/2019 12:00 AM' Keep in mind that Atlas search can involve multiple constructs and can become quite complex. Search can be conducted from:
the UI as basic search (using the funnel icon is the most powerful) the UI as advanced search (DSL) RestAPI Conclusion I hope these articles have given you an appreciation for how easily customizable Atlas is to represent metadata and lineage across your data and infrastructure landscape, and how powerful it is to search against it. Keep in mind that we have not even covered classification (tags), tag-based Ranger policies and business glossary. These additional capabilities cement Atlas as a powerful tool to understand and manage the growing and complex world of data you live in. Atlas is an outstanding governance tool to understand and manage your data landscape at scale ... and to easily customize governance specifically to your needs while seemlessly integrating Atlas' out of the box search, lineage, classification and business glossary capabilities. The only thing holding you back is your imagination 🙂
... View more
Labels:
12-27-2018
09:02 PM
4 Kudos
Introduction In Customizing Atlas (Part1): Model governance, traceability and registry I provided a brief overview of Atlas types and entities and showed how to customize them to fit your needs. I showed the specific example of a Model type used to govern your deployed data science models and complex Spark code. In Customizing Atlas (Part2): Deep source metadata and embedded entities I showed how to customize Atlas to hold knowledge of ingested data that goes deep beyond the data itself, e.g. detailes of the device that generated the data. I also showed how to implement the pattern of embedding an entity (not string) as an attribute value in your custom type. The result is a clickable hyperlink in the UI that opens that entity and its metadata. In this post I will: show how to extend your Atlas lineage to include processing and outputs on non-Hadoop systems represent the above as a single lineage that connects data in Hadoop to a reporting system which generates a report and sends an email with the report attached emphasize a key principle about Atlas: because its Rest API and customized types allow metadata to be sent and represented from any system, Atlas can centralize metadata from your entire data landscape. Concepts and Example Main Concept The main concepts is summarized below. The Atlas lineage includes processing and outputs on systems beyond Hadoop. Example: Reporting system The reporting system: inputs HDFS data outputs a report and archives it attaches the report in an email and sends the email This is shown in the digram below. Implementation ReportGenerator type is implemented as follows (one-time operation): curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["Process"],
"name": "reportGenerator",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "inputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "outputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "reportGenRegistryUrl",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportGenVersion",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportGenType",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportGenHost",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
}
]
}
]
}'
ReportGenerator entity is implemented as follows: curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "reportGenerator",
"attributes": {
"qualifiedName": "reportProcessor-v2.4@reportserver.genomiccompany.com",
"name": "disease-risk-report-v1.3",
"inputs": [{"uniqueAttributes": {"qualifiedName": "/data/genomics/variants/sample-AB15423@prodCluster"}, "typeName": "hdfs_path"}],
"outputs": [
{"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423.pdf"}, "typeName": "report"},
{"uniqueAttributes": {"qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432"}, "typeName": "email"}
],
"reportGenRegistryUrl": "https://git@github.com/reportengines/genomics/predictive-general",
"reportGenVersion": "2.4",
"reportGenType": "variant-disease-risk",
"reportGenHost": "reportserver.genomiccompany.com"
}
}
]
}'
Report type is implemented as follows (one-time operation): curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["DataSet"],
"name": "report",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": true,
"isOptional": false,
"isIndexable": true
},
{
"name": "owner",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportVersion",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportFilename",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportStorageURL",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportStartTime",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "reportEndTime",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
}
]
}
]
}'
Report entity is implemented as follows: curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "report",
"attributes": {
"qualifiedName": "disease-risk-gen-variance@AB15423.pdf",
"owner": "jobscheduler",
"name": "disease-risk-gen-variance",
"reportName": "genomics disease risk report - sample AB15423",
"reportVersion": "1.1",
"reportFilename": "genomics-disease-AB15423.pdf",
"reportStorageURL": "s3://genomics-disease/AB15423.pdf",
"reportStartTime": "2018-11-12T09:54:12.432Z",
"reportEndTime": "2018-11-12T09:54:14.341Z"
}
}
]
}'
Email type is implemented as follows (one-time operation): curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["DataSet"],
"name": "email",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": true,
"isOptional": false,
"isIndexable": true
},
{
"name": "owner",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailTo",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailFrom",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailCc",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailBcc",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailSubject",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailAttachments",
"typeName": "array<DataSet>",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "emailDate",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
}
]
}
]
}'
Email entity is implemented as follows: curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "email",
"attributes": {
"qualifiedName": "joesmith@company.com@2018-11-12_09:54:12.432",
"owner": "jobscheduler",
"name": "email ",
"emailTo": "drsmith@thehospital.com",
"emailFrom": "me@genomicscompany.com",
"emailCc": "archives@thehospital.com",
"emailBcc": "",
"emailSubject": "genomics disease risk report - patient AB15423",
"emailAttachments": [{"uniqueAttributes": {"qualifiedName": "disease-risk-gen-variance@AB15423_r-AA345744.pdf"}, "typeName": "report"}],
"emailDate": "2018-11-12T09:54:14.000Z"
}
}
]
}'
Results in Atlas UI Lineage Lineage shows the HDFS input file and the processing and output on the reporting system. Report Generator When we click the blue gear we see full metadata on the reporting engine, including host machine and url to its artifacts (e.g. deployed binary, code, etc). This is shown in screenshot below. Report From the lineage when we click the report output, we see full metadata on the generated report, including filename, archive location and creation time. This is shown in screenshot below. Email From the lineage when we click email output, we see familiar information about an email, including to, from, cc, subject and date. This is shown in screenshot below. Note that the attachment field shows a clickable link to the report entity that is attached. Clicking this link leads the same report screen as shown above. Summary: What have we accomplished? We can:
show a single lineage of data processing extending from Hadoop and continuing on non-Hadoop systems represent metadata of processing and outputs on non-Hadoop systems, including in the example here: report engines, emails, and reports The ideas here can be generalized for you to represent lineage and metadata of processing on any non-Hadoop system. Keep in mind also that you can continue the lineage to multiple systems both upstream and downstream from Hadoop, eg. external -> Hadoop -> external -> external. Key points:
Atlas is hosted on Hadoop, but its Rest API allows you to send processing knowledge from system beyond Hadoop Customized Atlas types let you integrate external system knowledge natively in Atlas As a result, you can represent knowledge of your full data ecosystem in Atlas, including search and lineage. So ... go out and tame your data landscape with centralized metadata in Atlas that reaches well-beyond Hadoop! References
Atlas Overview Using Apache Atlas on HDP3.0 Apache Atlas Altas Type System Atlas Rest API
... View more
Labels:
12-24-2018
02:57 PM
6 Kudos
Introduction In the previous post Customizing Atlas (Part1): Model governance, traceability and registry we:
provided a brief overview of Atlas types and entities showed how to customize Atlas types and entities to fit your own needs and appear in Atlas search and lineage customized a special type called model, which inherited from Process and empowered Atlas to govern the deployment of data science models commented on operationalizing custom entities In this post we will:
customize an Atlas type to represent deep source metadata (metadata beyond source data itself) customize an Atlas type to represent devices (metadata about the actual device that generates data) embed the device entity in the deep source entity (make the device entity an attribute value in the deep source metadata) show how device as an attribute value is a clickable link in the Atlas UI that opens to the full device entity Concepts and Example Main Concepts The main concepts are summarized in the table below. Example: Gene sequence data ingest We will use the following scenario to represent these ideas.
genomics company has multiple gene sequencing devices a technician conducts a run on the device, which outputs a blood sample's gene sequence which in turn is ingested to HDFS in Atlas, metadata for each device is instantiated as a device entity (Device type inherits from Infrastructure type) in Atlas, metadata for each gene sequence is instantiated as a gene_sequence entity (Gene_sequence type inherits from hdfs_path) each gene_sequence entity holds deep source metadata (in addition to metadata about the file on hdfs, also metadata about the device the generated the sequence and the specific run on the device, eg the technician's name) gene_sequence has an metadata attribute called device, which holds the actual device entity (not string) Implementation Device type is implemented as follows (one-time operation): ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["Infrastructure"],
"name": "device",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": true,
"isOptional": false,
"isIndexable": true
},
{
"name": "owner",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": true,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceId",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceType",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceModel",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceMake",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceImplemDate",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceDecomDate",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": true,
"isIndexable": true
}
]
}
]
}'
Device entity is instantiated as follows: ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "device",
"attributes": {
"qualifiedName": "Illumina-iSeq100-1092454@gene_sequencer",
"owner": "infra-group",
"name": "Illumina-iSeq100-1092454",
"deviceId": "1092454",
"deviceType": "gene_sequencer",
"deviceModel": "iSeq100",
"deviceMake": "Illumina",
"deviceImplemDate": "2018-08-21T19:49:24.000Z",
"deviceDecomDate": ""
}
}
]
}'
Gene_sequence type is implemented as follows (one-time operation) ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["hdfs_path"],
"name": "gene_sequence",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": true,
"isOptional": false,
"isIndexable": true
},
{
"name": "owner",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": true,
"isOptional": false,
"isIndexable": true
},
{
"name": "device",
"typeName": "device",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deviceQualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "runSampleId",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "runReads",
"typeName": "int",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "runStartTime",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "runEndTime",
"typeName": "date",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "runTechnician",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
}
]
}
]
}' Gene_sequence entity is instantiated as follows: ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "gene_sequence",
"attributes": {
"clusterName": "prod.genomicsanalytics.com",
"isFile": "true",
"fileSize": "2793872046",
"createdBy": "nifi",
"createTime": "2018-11-12T15:10:03.235Z",
"qualifiedName": "/data/sequence-pipeline/device-output/AB12357@prod.genomicsanalytics.com",
"owner": "jobscheduler",
"name": "/data/sequence-pipeline/device-output/AB12357",
"path": "hdfs://data/sequence-pipeline/device-output/AB12357",
"device": {"uniqueAttributes": {"qualifiedName": "Illumina-iSeq100-1092454@gene_sequencer"}, "typeName": "device"},
"deviceQualifiedName": "Illumina-iSeq100-1092454@gene_sequencer",
"runSampleId": "AB12357",
"runReads": "9",
"runTechnician": "Neeraj Gupta",
"runStartTime": "2018-11-12T09:54:12.432Z",
"runEndTime": "2018-11-12T15:09:59.351Z"
}
}
]
} Results in Atlas UI Search for gene_sequence entities We can now search the gene_sequence type and see results (only one result in this example .. and I used the 'Columns' dropdown to customize the result columns). Notice the Device as a hyperlink. Drill down to all metadata of a single entity Let's first click the Name to get full list of metadata both inherited from hdfs_path and customized for gene_sequence. Note that we see the standard hdfs_path properties (like fileSize, path, etc ... for ease of development here, I did not fill in all the values.. this would be done on ingest to the cluster). We also see the device metadata and the run metadata. Drill down to device metadata If we click on the link to 'device' (from either of the two places in screenshots above) we see the following. A note on search and embedded entities You'll notice that our embedded customized entity 'device' does not show up in search: we cannot directly search by the attributes of an embedded entity (though we can search by on our customized attributes like RunStartTime that uses native Atlas types). This is the reason I have used the device qualified name (String) as an attribute Notice how it is constructed: <make>-<model>-<id>@<type>. This allows us to use the search construct 'contains' to search for all gene_sequence entities (ie data in hdfs) that match a device's make, model, id or type, or combination of these. Summary: What have we accomplished? This example For the example here, we know deep knowledge of any gene sequence landed into HDFS. We know:
gene sequence HDFS path, file size, ingest time etc. which device generated the gene sequence data back in the lab details of the sample run that was sequenced on the device: technician's name, sample id, how long it took to run the sample, etc Generalizability The ideas here can be generalized for you to capture any source metadata that goes deeper than the data itself, and embed any custom entity as a clickable attribute value in another entity Use your imagination .. or rather, govern your data deeply. References
Atlas Overview Using Apache Atlas on HDP3.0 Apache Atlas Altas Type System Atlas Rest API Related Previous Posts Customizing Atlas (Part1): Model governance, traceability and registry Generalized Framework to Deploy Models and Integrate Apache Atlas for Model Governance Acknowledgements Appreciation to @eorgad and @Hari rongali for awesome collaboration in the 2018 NE Hackathon which generated many ideas, including those in this post (and for taking first place!).
... View more
Labels:
12-14-2018
02:40 PM
6 Kudos
Problem Statement: Deploying and Governing Models Machine Learning and Artificial Intelligence are in the process of exploding in importance and prevalence in the enterprise. With this explosive growth comes fundamental challenges in governing model deployments ... and doing this at scale. These challenges revolve around answering the following fundamental questions:
Which models were deployed when? and to where? Was this deployment to a microservice, a Spark context on Hadoop, or other? What was the serialized object deployed? How can I find it? What version was deployed? Who is the owner? What is the larger context around the project How do I know the details of the model, ie. how do I trace the model in Production to its actual code, its training data, owner, etc? Previous article: Why and how you should use Atlas to govern your models
Article: Customizing Atlas (Part1): Model governance, traceability and registry In the previous article I showed how Atlas is a powerful and natural fit for storing and searching model and deployment metadata. The main features of Atlas model metadata developed in the referenced article are
searchable metadata of deployments of models searchable metadata of models that were deployed traceability of deployed models to a model registry that holds concrete model artifacts (code, training data, serialized model used in deployment, project README.md file, etc) data lineage for deployed models that transform data during data pipelines no lineage generated for models deployed in a request-response context like microservices which output predictions and have high throughput of data inputs This article: Generalized Framework to Deploy Models with Apache Atlas for Model Governance In this article, I present an overarching deployment framework that implements this Atlas governance of models and thus allows stakeholders to answer the above questions as the number of deployed models proliferate. Think prevalence of ML and AI one, two, five years from now. The Framework Personas The personas involved in the model deployment-governance framework are shown below with their actions. Model owner: stages model artifacts in a defined structure and provides an overview of the model and project in a Read.me file. Operations: launches automation that deploys the model, copies artifacts from staging to model registry and creates a Model entity in Atlas for this deployment Multiple stakeholders: (data scientist, data steward, compliance, production issue troubleshooters, etc) use Atlas to answer fundamental questions about deployed models and to access concrete artifacts of those models). Deployment-Governance Framework Details of the deployment-governance and person interactions with it are framework are shown below. Step 1: Model owner stages the model artifacts. This includes:
code and training data README.md file describing project metadata.txt with key value pairs (model.name=<value>, model.type=<>, model.version=<>, model.description-<> .... serialized model for deployment (PMML, MLeap bundle, other) Step 2: operations deploys the model via an orchestrator automation. This automation:
2a: retrieves model artifacts from staging 2b: deploys serialized model 2c: copies artifacts to model repository (the automation orchestrator has been aggregating metadata from previous steps) 2d: creates new model entity in Atlas using aggregated metadata Step 3: use Atlas to understand deployed models result of deployment is Model entity created in Atlas (see Customizing Atlas (Part1): Model governance, traceability and registry for details) key capability is Atlas' powerful search techniques against metadata of deployed models, as shown in above diagram drill-down of model entity in search result provides understanding of deployment and model owner/project and provides traceability to concrete model artifacts in model registry Deployment-Governance Framework: Simple Implementation I show below how to implement the deployment framework. Important point: I have chosen the technologies shown below for a simple demonstration of the framework. Except for Atlas, technology implementations are your choice. For example, you could deploy your model to Spark on Hadoop instead of to a microservice, or you could use PMML instead of MLeap to serialize your model, etc. Important point summarized: This framework is a template and, except for Atlas, the technologies are swappable. Setting up your environment MLeap: follow the instuctions here to set up a dockerized MLeap Runtime http://mleap-docs.combust.ml/mleap-serving/ HDP: Create a HDP cluster sandbox using these instructions Atlas Model Type: When your HDP cluster is running, create your Atlas model type by running: #!/bin/bash
ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["Process"],
"name": "model",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "inputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "outputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "deploy.datetime",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deploy.host.type",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deploy.host.detail",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deploy.obj.source",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.version",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.type",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.description",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.owner",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.owner.lob",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.registry.url",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
}
]
}
]
}'
See Customizing Atlas (Part1): Model governance, traceability and registry for details Running the framework See GitHub repo README.md for details on running: https://github.com/gregkeysquest/ModelDeployment-microservice Main points are shown below. Staging (Github) See repo https://github.com/gregkeysquest/Staging-ModelDeploy-v1.0 for details. Main points are: MLeap bundle (serialized model) is in path /executable the file modelMetadata.txt holds metadata about the model that will be pushed to Atlas model entity -- contents are shown below model.owner = Greg Keys
model.owner.lob = pricing
model.name = rental pricing prediction
model.type = gradient boosting regression
model.version = 1.1
model.description = model predicts monthly price of rental if property is purchased
model.microservice.endpoint=target
Orchestrator (Groovy calling shell scripts) The core code for the Groovy orchestrator is shown below //STEP 1: retrieve artifacts
println "[STEP 1: retrieve artifacts] ..... downloading repo to tmp: repo=${repo} \n"
processBuilder = new ProcessBuilder("shellScripts/fetchRepo.sh",
repo,
repoCreds,
repoRoot).inheritIO().start().waitFor()
//metadata aggregation
println "[metadata aggregation] ..... gathering model metadata from repo \n "
ModelMetadata.loadModelMetadata(repo,localRepo)
//STEP 2: deploy serialized model
def modelExecutable=new File("tmp/${repo}/executable").listFiles()[0].getName()
println "[STEP 2: deploy serialized model] ..... deploying model to microservice: modelToDeploy=${modelExecutable} \n "
processBuilder = new ProcessBuilder("shellScripts/deployModel.sh",
repo,
deployHostPort,
modelExecutable).inheritIO().start().waitFor()
//STEP 3: put artifacts to registry
def modelRegistryPath="hdfs://${hdfsHostName}:8020${hdfsRegistryRoot}/${repo}"
println "[STEP 3: put artifacts to registry] ..... copying tmp to model registry: modelRegistryPath=${modelRegistryPath} \n "
processBuilder = new ProcessBuilder("shellScripts/pushToRegistry.sh",
repo,
modelRegistryPath,
devMode.toString()).inheritIO().start().waitFor()
//metadata aggregation
println "[metadata aggregation] ..... gathering model deploy metadata \n "
ModelMetadata.loadDeployMetadata(modelRegistryPath,
modelExecutable,
deployHostPort,
deployHostType)
//STEP 4: create Atlas model entity
println "[STEP 4: create Atlas model entity] ..... deploying Atlas entity to ${atlasHost} \n "
processBuilder = new ProcessBuilder("shellScripts/createAtlasModelEntity.sh",
atlasCreds,
atlasHost,
ModelMetadata.deployQualifiedName,
ModelMetadata.deployName,
ModelMetadata.deployDateTime,
ModelMetadata.deployEndPoint,
ModelMetadata.deployHostType,
ModelMetadata.modelExecutable,
ModelMetadata.name,
ModelMetadata.type,
ModelMetadata.version,
ModelMetadata.description,
ModelMetadata.owner,
ModelMetadata.ownerLob,
ModelMetadata.registryURL
)
Notice how the steps map directly to the Deployment-Governance Framework diagram above how metadata is processed and aggregated in two steps: one for model metadata and the other for deployment metadata Code for processing and aggregating metadata is shown here class ModelMetadata {
static metadataFileLocation = "staging/modelMetadata.txt"
static Properties props = null
static repo = ""
static owner = ""
static ownerLob = ""
static name = ""
static type = ""
static version = ""
static description = ""
static endpoint = ""
static registryURL = ""
static modelExecutable = ""
static deployEndPoint = ""
static deployHostType = ""
static deployDateTime = ""
static deployName = ""
static deployQualifiedName = ""
static void loadModelMetadata(repo, localRepo){
this.repo = repo
props = new Properties()
def input = new FileInputStream(localRepo +"/modelMetadata.txt")
props.load(input)
this.owner = props.getProperty("model.owner")
this.ownerLob = props.getProperty("model.owner.lob")
this.name = props.getProperty("model.name")
this.type = props.getProperty("model.type")
this.version = props.getProperty("model.version")
this.description = props.getProperty("model.description")
this.endpoint = props.getProperty("model.microservice.endpoint")
}
static loadDeployMetadata(modelRegistryPath, modelExecutable, deployHostPort, deployHostType) {
this.deployDateTime = new Date().format('yyyy-MM-dd_HH:mm:ss', TimeZone.getTimeZone('EST'))+"EST"
this.deployName = "${this.name} v${this.version}"
this.deployQualifiedName = "${this.deployName}@${deployHostPort}".replace(' ', '-')
this.registryURL=modelRegistryPath
this.modelExecutable=modelExecutable
this.deployEndPoint = "http://${deployHostPort}/${this.endpoint}"
this.deployHostType = deployHostType
}
}
Shell Scripts Each shell script that is called by the orchestrator is shown in the code blocks below Step 1: fetch staging (maps to 2a in diagram) #!/bin/bash
# script name: fetchRepo.sh
echo "calling fetchRepo.sh"
REPO=$1
REPO_CRED=$2
REPO_ROOT=$3
# create tmp directory to store stagin
cd tmp
# fetch staging and unzip
curl -u $REPO_CRED -L -o $REPO.zip http://github.com/$REPO_ROOT/$REPO/zipball/master/
unzip $REPO.zip
# rename to simplify downstream processing
mv ${REPO_ROOT}* $REPO
# remove zip
rm $REPO.zip
echo "finished fetchRepo.sh"
Step 2: deploy model (maps to 2b in diagram) #!/bin/bash
# script name: deployModel.sh
echo "starting deployModel.sh"
REPO=$1
HOSTPORT=$2
EXECUTABLE=$3
# copy executable to staing to deploy to target
echo "copying executable to load path with command: cp tmp/${REPO}/executable/* ../loadModel/"
mkdir loadModel
cp tmp/$REPO/executable/* loadModel/
# simplify special string characters
Q="\""
SP="{"
EP="}"
# create json for curl string
JSON_PATH="${SP}${Q}path${Q}:${Q}/models/${EXECUTABLE}${Q}${EP}"
# create host for curl string
URL="http://$HOSTPORT/model"
# run curl string
echo "running command: curl -XPUT -H \"content-type: application/json\" -d ${JSON_PATH} ${URL}"
curl -XPUT -H "content-type: application/json" -d $JSON_PATH $URL
echo "finished deployModel.sh"
Step 3: copy staging to model repository (maps to 2c in diagram) #!/bin/bash
# script name: pushToRegistry.sh
## Note: for ease of development their is a local mode to write to local file system instead of hdfs
echo "calling pushToRegistry.sh"
REPO_LOCAL=$1
HDFS_TARGET=$2
DEV_MODE=$3
cd tmp
echo "copying localRepository=${REPO_LOCAL} to hdfs modelRegistryPath=${HDFS_TARGET}"
if [ $DEV_MODE ]; then
MOCK_REGISTRY="../mockedHDFSModelRegistry"
echo "NOTE: in dev mode .. copying from ${REPO_LOCAL} to ${MOCK_REGISTRY}"
mkdir $MOCK_REGISTRY
cp -R $REPO_LOCAL $MOCK_REGISTRY/
else
sudo hdfs -dfs cp $REPO_LOCAL $HDFS_TARGET
fi
echo "finished pushToRegistry.sh"
Step 4: create Atlas model entity (maps to 2c in diagram) #!/bin/bash
# script name: createAtlasModelEntity.sh
echo "starting createAtlasModelEntity.sh"
ATLAS_UU_PWD=$1
ATLAS_HOST=$2
echo "running command: curl -u ${ATLAS_UU_PWD} -ik -H \"Content-Type: application/json\" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d (ommitting json)"
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "model",
"attributes": {
"qualifiedName": "'"${3}'"",
"name": "'"${4}"'",
"deploy.datetime": "'"${4}"'",
"deploy.host.type": "'"${5}"'",
"deploy.host.detail": "'"${6}"'",
"deploy.obj.source": "'"${7}"'",
"model.name": "'"${8}"'",
"model.type": "'"${9}"'",
"model.version": "1.1",
"model.description": "'"${10}"'",
"model.owner": "'"${11}"'",
"model.owner.lob": "'"${12}"'",
"model.registry.url": "'"${13}"'"
}
}
]
}'
echo "finished createAtlasModelEntity.sh"
Summary: What have we accomplished? We have: designed a generalized deployment framework for models that integrates and leverages Atlas as a centralized governance tool for these deployments one key component is the orchestrator which aggregates metadata among process steps and then passes this to Atlas built upon the implementation and ideas developed in this previous article presented a simple implementation using technologies shown above Remember the key point that the deployment framework presented here is generalizable: except for Atlas you can plug in your choice of technologies for the orchestration, staging, model hosting and model repository, including elaborating the framework into a formal software development framework of your choice. References
Customizing Atlas (Part1): Model governance, traceability and registry Atlas brief Atlas deep Groovy GitHub MLeap Acknowledgements Appreciation to the Hortonworks Data Science SME groups for their feedback on this idea. Particular appreciation to @Ian B and @Willie Engelbrecht for their deeper attention and interest.
... View more
12-12-2018
05:08 PM
15 Kudos
Problem Statement: Model governance Data science and model building are prevalent activities that bring new and innovative value to enterprises. The more prevalent this activity becomes, the more problematic model governance becomes. Model governance typically centers on these questions:
What models were deployed? when? where? What was the serialized object deployed? Was deployment to a microservice, a Spark context, other? What version was deployed? How do I trace the deployed model to its concrete details: the code, its training data, owner, Read.me overview, etc? Apache Atlas is the central tool in organizing, searching and accessing metadata of data assets and processes on your Hadoop platform. Its Rest API can push metadata from anywhere, so Atlas can also represent metadata off your Hadoop cluster. Atlas lets you define your own types of objects and inherit from existing out-of-the box types. This lets you store whatever metadata you want to store, and to tie this into Atlas's powerful search, classification and taxonomy framework. In this article I show how to create a custom Model object (or more specifically 'type') to manage model deployments the same as you govern the rest of your data processes and assets using Atlas. This custom Model type lets you answer all of the above questions for any model you deploy. And ...it does so at scale while your data science or complex Spark transformation models explode in number, and you transform your business to enter the new data era. In a subsequent article I implement the Atlas work developed here into a larger model deployment framework: https://community.hortonworks.com/articles/229515/generalized-model-deployment-framework-with-apache.html A very brief primer on Atlas: types, entities, attributes, lineage and search Core Idea The below diagram represents the core concepts of Atlas: types, entities, attributes. (Let's save the ideas classification and taxonomy for another day). A type is an abstract representation of an asset. A type has a name and attributes that hold metadata on that asset. Entities are concrete instances of a type. For example, hive_table is a type that represents any hive_table in general. When you create an actual hive table, you will create a new hive_table entity in Atlas, with attributes like table name, owner, create time, columns, external vs managed, etc. Atlas comes out of the box with many types, and services like Hive have hooks to Atlas to auto-create and modify entities in Atlas. You can also create your own types (via the Atlas UI or Rest API). After this, you are in charge of instantiating entities ... which is easy to do via the RestAPI called from your job scheduler, deploy script or both. System Specific Types and Inheritance Atlas types are organized around the below inheritance model of types. Out of the box types like hive_table inherit from here and when you create customized types you should also. The most commonly used parent types in Atlas are DataSet (which represents any type and level of stored data) and Process (which represents transformation of data). Lineage Notice that Process has an attribute for an array of one or more input DataSets and another for output DataSets. This is how Process creates lineages of data processed to new data, as shown below. Search Now that Atlas is filled with types, entities and lineages ... how do you make sense of it all? Atlas has extremely powerful search constructs that let you find entities by attribute values (you can assemble AND/OR constructs among attributes of a type, using equals, contains, etc). And of course, anything performed on the UI can be done through the Rest API). Customizing Atlas for model governance My approach: I first review a customized Model type and then show how to implement it. Implementation comes in two steps: (1) create the custom Model type, and then (2) instantiate it with Model entities as they are deployed in your environment. I make a distinction between models that are (a) deployed on Hadoop in a data pipeline processing architecture (e.g. complex Spark transformation or data engineering models) and (b) deployed in a microservices or Machine Learning environment. In the first data lineage makes sense (there is a clear input, transformation, output pipeline) whereas in the second it does not (it is more of a request-response model with high throughput requests). I also show the implementation as hard-coded examples and then as an operational example where values are dynamic at deploy-time. In a subsequent article I implement the customized Model type in a fully automated model deployment and governance framework. Customized Atlas Type: Model The customized model type is shown in the diagram below. You can of course exclude shown attributes or include new ones as you feel appropriate for your needs. Key features are:
deploy.: attributes starting with deploy describe metadata around the model deployment runtime deploy.datetime: the date and time the model was deployed deploy.host.type: type of hosting environment for deployed model (e.g. microservices, hadoop) deploy.host.detail: specifically where model was deployed (e.g. microservice endpoint, hadoop cluster) deploy.obj.source: location of serialized model that was deployed model. : attributes describing the model that was deployed (most self-explanatory)
model.registry.url: provides traceability to model details; points to model registry holding model artifacts including code, training data, Read.me by owner, etc Step 1: Create customized model type (one-time operation) Use the Rest API by running the below curl command with json construct. #!/bin/bash
ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/types/typedefs -d '{
"enumDefs": [],
"structDefs": [],
"classificationDefs": [],
"entityDefs": [
{
"superTypes": ["Process"],
"name": "model",
"typeVersion": "1.0",
"attributeDefs": [
{
"name": "qualifiedName",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "inputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "outputs",
"typeName": "array<DataSet>",
"isOptional": true,
"cardinality": "SET",
"valuesMinCount": 0,
"valuesMaxCount": 2147483647,
"isUnique": false,
"isIndexable": false,
"includeInNotification": false
},
{
"name": "deploy.datetime",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deploy.host.type",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deploy.host.detail",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "deploy.obj.source",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.name",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.version",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.type",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.description",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.owner",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.owner.lob",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
},
{
"name": "model.registry.url",
"typeName": "string",
"cardinality": "SINGLE",
"isUnique": false,
"isOptional": false,
"isIndexable": true
}
]
}
]
}'
Notice we are (a) using superType 'Process', (b) giving the type name 'model', and (c) creating new attributes in the same attributeDefs construct as those inherited by Process. Step 1 result When we go the Atlas UI we see the 'model' type listed with the other types, and we see the customized attribute fields in the Columns drop down. Step 2: Create model entity (each time you deploy model or new model version) Example 1: With lineage (for clear input/process/output processing of data) Notice two DataSets (type 'hdfs_path') are inputted to the model and one is outputted, as identified by their Atlas guid. #!/bin/bash
ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "model",
"attributes": {
"qualifiedName": "model:disease-risk-HAIL-v2.8@ProdCluster",
"name": "disease-risk-HAIL-v2.8",
"deploy.datetime": "2018-12-05_15:26:41EST",
"deploy.host.type": "hadoop",
"deploy.host.detail": "ProdCluster",
"deploy.obj.source": "hdfs://prod0.genomicscompany.com/model-registry/genomics/disease-risk-HAIL-v2.8/Docker",
"model.name": "disease-risk-HAIL",
"model.type": "Spark HAIL",
"model.version": "2.8",
"model.description": "disease risk prediction for sequenced blood sample",
"model.owner": "Srinivas Kumar",
"model.owner.lob": "genomic analytics group",
"model.registry.url": "hdfs://prod0.genomicscompany.com/model-registry/genomics/disease-risk-HAIL-v2.8",
"inputs": [
{"guid": "cf90bb6a-c946-48c8-aaff-a3b132a36620", "typeName": "hdfs_path"},
{"guid": "70d35ffc-5c64-4ec1-8c86-110b5bade70d", "typeName": "hdfs_path"}
],
"outputs": [{"guid": "caab7a23-6b30-4c66-98f1-b2319841150e", "typeName": "hdfs_path"}]
}
}
]
}'
Example 2: No lineage (for request-response type of model eg. microservices or ML scoring) Similar to above, but no inputs and outputs specified. #!/bin/bash
ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "model",
"attributes": {
"qualifiedName": "model:fraud-persloan-model-v1.1@https://service.bankcompany.com:6532/fraud",
"name": "fraud-persloan-model-v1.1",
"deploy.datetime": "2018-10-22_22:01:41EST",
"deploy.host.type": "microservice",
"deploy.host.detail": "https://service.bankcompany.com:6532/fraud",
"deploy.obj.source": "hdfs://prod-nn.bankcompany.com/model-registry/personal-loans/fraud-persloan-model-v1.1/fraud.persloan.lr.zip",
"model.name": "fraud-persloan-model",
"model.type": "Spark ML Bayesian learning nn",
"model.version": "1.1",
"model.description": "fraud detection for personal loan application",
"model.owner": "Beth Johnson",
"model.owner.lob": "personal loans",
"model.registry.url": "hdfs://prod-nn.bankcompany.com/model-registry/personal-loans/fraud-persloan-model-v1.1"
}
}
]
}'
Step 2 result Now we can search the 'model' type in the UI and see the results (below). When we click on 'disease-risk-HAIL-v2.8 we see the attribute values, and when we click on Relationships and then on a DataSet we see the lineage (below). After we click Relationships we see the image on left. After then clicking a DataSet in the relationship, we the lineage on right (below). For models deployed with no inputs and outputs values, the result is similar to above but no 'Relationships' nor 'Lineage' is created. A Note on Operationalizing The above entity creation was done using hardcoded values. However, in an operational environment these values will be created dynamically for each model deployment (entity creation). In this case the values are gathered by the orchestrator or deploy script, or both, and passed to the curl command. It will look something like this. #!/bin/bash
ATLAS_UU_PWD=$1
ATLAS_HOST=$2
curl -u ${ATLAS_UU_PWD} -ik -H "Content-Type: application/json" -X POST http://${ATLAS_HOST}:21000/api/atlas/v2/entity/bulk -d '{
"entities": [
{
"typeName": "model",
"attributes": {
"qualifiedName": "model:'"${3}"'@'"${6}"'",
"name": "'"${3}"'",
"deploy.datetime": "'"${4}"'",
"deploy.host.type": "'"${5}"'",
"deploy.host.detail": "'"${6}"'",
"deploy.obj.source": "'"${7}"'",
"model.name": "'"${8}"'",
"model.type": "'"${9}"'",
"model.version": "'"${10}"'",
"model.description": "'"${11}"'",
"model.owner": "'"${12}"'",
"model.owner.lob": "'"${13}"'",
"model.registry.url": "'"${14}"'"
}
}
]
}'
Do notice the careful use of single and double quotes around each shell script variable above. The enclosing single quotes break and then reestablish the json string and the enclosing double quotes allows for spaces inside the variable values. Summary: What have we accomplished? We have: centralized the governance of data science models (and any complex Spark or other code) in the same Atlas metadata framework we use for the rest of our deployed systems leveraged Atlas search against model attributes to make sense of these model deployments traced a model's deployment to its concrete artifacts (code, training data, serialized deploy object, model owner Read.me file, etc) stored in a model registry Data science ... you have now been formally governed along with the rest of the data world 🙂 Crank out more models ... we'll take care of the rest! References
Atlas Overview Using Apache Atlas on HDP3.0 Apache Atlas Altas Type System Atlas Rest API Misc HCC articles
https://community.hortonworks.com/articles/136800/atlas-entitytag-attribute-based-searches.html https://community.hortonworks.com/articles/58932/understanding-taxonomy-in-apache-atlas.html https://community.hortonworks.com/articles/136784/intro-to-apache-atlas-tags-and-lineage.html https://community.hortonworks.com/articles/36121/using-apache-atlas-to-view-data-lineage.html https://community.hortonworks.com/articles/63468/atlas-rest-api-search-techniques.html (but v1 of API) https://community.hortonworks.com/articles/229220/adding-atlas-classification-tags-during-data-inges.html Implementing the ideas here into a model deployment framework: https://community.hortonworks.com/articles/229515/generalized-model-deployment-framework-with-apache.html Acknowledgements Appreciation to the Hortonworks Data Governance and Data Science SME groups for their feedback on this idea. Particular appreciation to @Ian B and @Willie Engelbrecht for their deep attention and interest.
... View more
11-21-2018
08:17 PM
10 Kudos
Introduction Naïve Bayes is a machine learning model that is simple and
computationally light yet also accurate in classifying text as compared to more
complex ML models. In this article I
will use the Python scikit-learn libraries to develop the model. It is developed on a Zeppelin notebook on top
of the Hortonworks Data Platform (HDP) and uses its %spark2.pyspark interpreter
to run python on top of Spark. We will use a news feed to train the model to classify
text. We will first build the basic
model, then explore its data and attempt to improve the model. Finally, we will compare performance accuracy
of all the models we develop. A note on code structuring: Python import statements are
introduced when required by the code and not all at once upfront. This is to
relate the packages closely to the code itself. The Zeppelin template for the full notebook can be obtained from here. A. Basic Model Get the data We will use news feed data that has been classified as
either 1 (Business), 2 (Sports), 3 (Business) or 4 (Sci/Tech). The data is structured as CSV with fields:
class, title, summary. (Note that later processing converts the labels to 0,1,2,3 respectively). Engineer the
data The class, title and summary fields are appended to their
own arrays. Data cleansing is done in
the form of punctuation removal and conversion to lowercase. Convert the
data The scikit-learn packages need these arrays
to be represented as dataframes, which assign an int to each row of the array
inside the data structure. Note: This first model will use summaries
to classify test, as shown in the code. Vectorize the
data Now we start using the machine learning packages. Here we convert the dataframes into a
vector. The vector is a wide n x m matrix
with n records and for each record m fields that hold a position for each word
detected among all records, and the word frequency for that record and position. This is a sparse matrix since most m
positions are not filled. We see from
the output that there are 7600 records and 20027 words. The vector shown in the output is partial, showing part of the 0th record with word index positions 11624, 6794, 6996 etc. Fit (train) the model
and determine its accuracy Let’s fit the model.
Note that we split the data into a training set with 80% of the records and
validation step with the remaining. Wow! 87% of our tests accurately predicted the text classification. That’s good. Deeper dive on results We can look more deeply than the single accuracy score
reported above. One way is to generate a
confusion matrix as shown below. The
confusion matrix shows for any single true single class, the proportion of
predictions it made against all predicted classes. We see that Sports text almost always predicted its classification correctly (0.96). Business and Sci/Tech were a bit more blurred: When Business text was incorrectly predicted, it was usually against Sci/Tech and the converse for Sci/Tech. This all makes sense since Sports vocabulary is quite distinctive and Sci/Tech is often in the Business news. There are other views of model outcomes ... check the
sklearn.metrics api. Test the model
with a single news feed summary Now take a single news feed summary and see what the model
predicts. I have run may through the model and it performs quite well. The new text in shown above gets a clear Business classification. When I run news summaries on cultural items (no category in the model), the predictions are low and spread across all categories, as expected. B. Explore the data Word and anagram
counts Let’s get the top 25 words and anagrams (phrases, in our
case two words) among all training set text that were used to build the
model. These are shown below. Hmm ... there are a lot of common meaningless words
involved. Most of these are known as
stopwords in natural language processing.
Let’s remove the stop words and see if the model improves. Remove
stopwords: retrieve list from file and fill array The above are stopwords from a file. The below allows you to iteratively add
stopwords to the list as you explore the data. Word and
anagram counts with stopwords removed Now we can see the top 25 words and anagrams after the
stopwords are removed. Note how easy
this is to do: we instantiate the CountVectorizer exactly as before, but by
passing a reference to the stopword list. This is a good example of how powerful the skilearn
libraries are: you interact with the high level apis and the dirty work is done
under the surface. C. Try to improve the model Now we train the same model for news feed summaries with no
stop words (left) and for titles and no stopwords. Interesting ... the model using summaries with no stop words
is equally accurate as the one with them included in the text. Secondly, the titles model is less accurate
than the summary model, but not by much (not bad for classifying text from
samples of only 10-20 words). D. Comparison of model accuracies I trained each of the below models 5 times each: news feed text from summaries (no stops, with
stops) and from titles (no stops, with stops).
I averaged the accuracies and plotted as shown below. E. Conclusions Main Conclusions are:
Naive Bayes effectively
classified text particularly given small text sizes (news titles, news
summaries of 1-4 sentences)
Summaries classified with more
accuracy than titles, but not by much considering how few attributes
(words) are in a single title
Removing stop words had no
significant effect on model accuracy. This should not be surprising
because stop words are expected to be represented equally among
classifications
Recommended model: summary with
stop words retained, because it has highest accuracy and lower complexity
of code and lower processing needs than the equally accurate summary with
stop words removed model
This model is susceptible to
overfitting because the
stories represent a window of time and its text (news words) accordingly
is biased toward the “current” events at that time. F. References Zeppelin https://hortonworks.com/apache/zeppelin/ The zeppelin notebook for this article https://github.com/gregkeysquest/DataScience/tree/master/projects/naiveBayesTextClassifier/newsFeeds https://www.rescuetime.com/blocked/url/http://rss.cnn.com/rss/cnn_topstories.rss https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/ https://scikit-learn.org/stable/ https://matplotlib.org/
... View more
03-02-2018
09:07 PM
26 Kudos
Introduction Transparent
Data Encryption (TDE) encrypts HDFS data on disk ("at rest").
It works thorough an interaction of multiple Hadoop components and security
keys. Although the general idea may be easy to understand, major misconceptions
can occur and the detailed mechanics and many acronyms can get confusing.
Having an accurate and detailed understanding can be valuable, especially when
working with security stakeholders who often serve as strict gatekeepers in
allowing technology in the enterprise. Transparent Data
Encryption: What it is and is not TDE
encrypts HDFS data on disk in way that is transparent to the user, meaning the
user accesses the HDFS data encrypted on disk identical to accessing
non-encrypted data. No knowledge or implementation changes are needed on the
client side and the user sees the data in its unencrypted form. TDE works at
the file path or encryption zone level: all files written to designated paths
(zones) are encrypted on disk. The
goal of TDE is to prevent anyone who is inappropriately trying to access data
from doing so. It guards against the threat, for example, of someone finding a
disk in a dumpster or stealing one, or someone poking around HDFS who is not a
user of the application or use case accessing the data. The goal of TDE is not to hide sensitive data elements
from authorized users (e.g. masking social security numbers). Ranger policies for authorization, column
masking and row-filtering is for that. Do
note that Ranger policies like masking can be applied on top of TDE data
because Ranger policies are implemented post-decryption. That again is the transparent part of TDE. The below chart provide a brief comparison of TDE with
related approaches to data security.
Data Security Approach
Description
Threat category
Encrypted drives & Full
disk encryption
Contents of entire disk is
encrypted. Access requires proper
authentication key or password.
Accessed data is unencrypted.
Unintended access to hard
drive contents.
Transparent Data Encryption on
Hadoop
Individual HDFS files are
encrypted. All files in designated
paths (zones) are encrypted whereas those outside these paths are not. Access requires user, group or service
inclusion in policy against the zone.
Accessed data is unencrypted.
Authorization includes read and/or write for zone.
Unintended access to designated
HDFS data on hard drive.
Ranger access policies
(e.g. authorization to HDFS
paths, Hive column masking, row filtering, etc)
No relevance to
encryption. Access requires user or
group inclusion in authorization policy.
Accessed data is either full file, table, etc, masked columns or
filtered rows etc. Authorization
includes read, write, create table, etc.
Viewing HDFS data or data
elements that is inappropriate to the user (e.g social security number).
Accessing TDE Data: High
Level View In Hadoop, users access HDFS through a service (HDFS, Hive, etc). The service accesses HDFS data through its own HDFS client, which is abstracted from the user. For files written to or read from encryption zones, the service’s HDFS client contacts a Key Management Sevice (KMS) to validate the user and/or service read/write permissions to the zone and, if permitted, the KMS provides the master key to encrypt (write) or decrypt (read) the file. To decrypt a Hive table, for example, the encrypted zone
would represent the path to the table data and the hive service would need read/write
permissions to that zone. The KMS master key is called the Enterprise Zone Key (EZK)
and encrypting /decrypting involves two additional keys and a detailed sequence
events. This sequence is described
below. Before that though, lets get to know
the acronyms and technology components that are involved and then put all of
the pieces together. One note on encryption zones: EZs can be nested within
parent EZs, i.e. an EZ can hold encrypted files and an EZ that itself holds
encrypted files and so on. (More on this
later). Acronym Soup Let’s simply name the acronyms before we understand them. TDE: Transparent Data Encryption EZ: Encryption Zone KMS: Key Management Server (or Service). Ranger KMS is part of
Hortonworks Stack. HSM: Hardware Security Module EZK: Encryption Zone Key DEK: Data Encryption Key EDEK: Encrypted Data Encryption Key The Puzzle Pieces In this section let’s just look at the pieces of the puzzle
before putting them all together. Let’s
first look at the three kinds of keys and how they map to the technology
components on Hadoop that are involved in TDE. Key Types Three key types are involved
in encrypting and decrypting files. Main idea is that each file in an EZ is encrypted and
decrypted by a DEK. Each file has its
own unique DEK. The DEK will be encrypted into an EDEK and decrypted back to
the DEK by an EZK for the particular EZ the file belongs. Each zone has its own single EZK. (We will see that EZKs can be rotated, in
which case the new EZK retains its name but increments a new version number). Hadoop Components Five Hadoop components are involved in TDE. Service HDFS client: Uses
a file’s DEK to encrypt/decrypt the file. Name Node: Stores
each file’s EDEK (encrypted DEK) in file’s metadata. Ranger KMS: Uses
zone’s EZK to encrypt DEK to EDEKor decrypt EDEK to DEK keystore: stores
each zone’s EZK Ranger UI: UI
tool for special admin (keyadmin) to manage EZKs (create, name, rotate) and
policies against the EZK (user, group, service that can use it it read and/or
write files to EZ) Notes Ranger KMS
and UI is part of the Hortonworks Hadoop stack but 3 rd party KMSs
may be used. Keystore is Java keystore native to Ranger and
thus software based. Compliance
requirements like PCI required by financial institutions demand that keys are
managed are stored in hardware (HSM) which is deemed more secure. Ranger KMS can integrate with SafeNet Luna
HSM in this case. Implementing TDE: Create
encryption zone Below are the steps to create an encryption zone. All files written to an encryption zone will
be encrypted. First, the keyadmin user leverages the Ranger UI to create
and name an EZK for a particular zone. Then
the keyadmin applies a policy against that key and thus zone (who can read
and/or write to the zone). The who here
are actual services (e.g. Hive) and users and groups, which can be synced from
an enterprise AD/LDAP instance. Next an hdfs superuser runs the hdfs commands shown above to
instantiate the zone in HDFS. At this
point, files can be transparently written/encrypted to the zone and read/decrypted
from it. Recall the importance of the transparent aspect:
applications (e.g. SQL client accessing Hive) need no special implementation to
encrypt/decrypt to/from disk because that implementation is abstracted away
from the application client. The details
of how that works are shown next. Write Encrypted File (Part
1): generate file-level EDEK and store on Name Node When a file is written to an encryption zone an EDEK (which
encrypts a DEK unique to the file) is created and stored in the Name Node
metadata for that file. This EDEK will
be involved with both writing the encrypted file and decrypting it when reading. When the service HDFS client tells the Name Node it wants to
write a file to the EZ, the Name Node requests the KMS to return a unique DEK
encrypted as and EDEK. The KMS does this
by generating a unique DEK, pulling from the keystore the EZK for the files
intended zone, and using this to encrypt the DEK to EDEK. This EDEK is returned to the NameNode and
stored along with the rest of the file’s metadata. Write Encrypted File (Part
2): decrypt EDEK to use DEK for encrypting file Immediate next step to the above is for the service HDFS
client to use the EDEK on the NameNode to expose its DEK to then encrypt the
file to HDFS. The process is shown in diagram above. The KMs again uses the EZK to but this time
to decrypt the EDEK to DEK. When the
service HDFS client retrieves the DEK it uses it to encrypt the file blocks to
HDFS. (These encrypted blocks are
replicated as such). Read Encrypted File:
decrypt EDEK and use DEK to decrypt file Reading the encrypted file runs the same process as writing
(sending EDEK to KMS to return DEK). Now
the DEK is used to decrypt the file. The
decrypted file is returned to the user who transparently needs not know or do
anything special in accessing the file in HDFS.
The process is shown explicitly below. Miscellaneous Encrypt Existing Files
in HDFS Encrypting existing files in HDFS cannot be done where they
reside. The files must be copied to an
encryption zone to do so (distcp is an effective way to do this, but other
tools like NiFi or shell scripts with HDFS commands could be used). Deleting the original directory name and
renaming the encryption zone to it leaves the end state identical to the
beginning but with the files encrypted. Rolling Keys Enterprises often require keys to roll (replace) at intervals,
typically 2 years or so. Rolling EZKs does
NOT
require existing encrypted files to be re-encrypting. It works as follows:
Keyadmin updates EZK via Ranger UI. Name is retained and KMS increments key to
new version. EDEKs on NameNodes are reencrypted (KMS decrypts
EDEK with old key, resulting DEK encrypted with new key, resulting EDEK placed
back on NameNode file metadata. All new writes and reads of existing files use
new EZK. TDE performance
benchmarking TDE does effect performance but it is not extreme. In general, writes are affected more than
reads, which is good because performance SLAs are usually focused more on users
reading data than users or batch processes writing it. Below is a representative benchmarking. Encrypting
Intermediate Map-Reduce or Tez Data Map-reduce and Tez write data to disk as intermediate steps
in a multi-step job (Tez less so than M-R and thus its faster performance). Encrypting this intermediate data can be done
by setting the configurations shown in this link. Encrypting intermediate data will add greater
performance overhead and is especially cautious since the data remains on disk
for short durations of time and much less likely to be compromised compared to
HDFS data. A Side Note on HBase Apache HBase has its own encryption at rest framework that is similar to HDFS TDE (see link). This is not officially supported on HDP. Instead, encryption at rest for HBase should use HDFS TDE as described in this article and specified here. References TDE general https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/ch_hdp-security-guide-hdfs-encryption.html https://hortonworks.com/blog/new-in-hdp-2-3-enterprise-grade-hdfs-data-at-rest-encryption/
slide https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html Ranger KMS https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/ranger-kms-admin-guide.html https://hadoop.apache.org/docs/stable/hadoop-kms/index.html TDE Implementation Quick
Steps https://community.hortonworks.com/content/supportkb/49505/how-to-correctly-setup-the-hdfs-encryption-using-r.html https://community.hortonworks.com/content/kbentry/42227/using-transparent-data-encryption-in-hdfs.html TDE Best Practices https://community.hortonworks.com/questions/74758/what-are-the-best-practices-around-hdfs-transparen.html
... View more
02-13-2018
01:51 AM
Thanks @Sreekanth Munigati .. that worked! s3a://demo/ does not work s3a://demo/folder does!
... View more
02-12-2018
01:39 AM
Issue Issue I am having is this here, but setting the two configs are not working for me (seems it works for some and not others) https://forums.aws.amazon.com/message.jspa?messageID=768332 Below is full description Goal: I am writing this test query with small data size to output results to S3. INSERT OVERWRITE DIRECTORY 's3a://demo/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
select * from demo_table;
Notes:
This query runs when outputted to HDFS directory. I can create an external table locally against s3 remotely.. so my configurations are working (CREATE EXTERNAL TABLE ... LOCATION 's3a://demo/'; works) Only when outputing a query to s3 do I get a failure (below) Error: The error when attempting the query to output to S3 is: 2018-02-12 01:12:58,790 INFO [HiveServer2-Background-Pool: Thread-363]: log.PerfLogger (PerfLogger.java:PerfLogEnd(177)) - </PERFLOG method=releaseLocks start=1518397978790 end=1518397978790 duration=0 from=org.apache.hadoop.hive.ql.Driver>
2018-02-12 01:12:58,791 ERROR [HiveServer2-Background-Pool: Thread-363]: operation.Operation (SQLOperation.java:run(258)) - Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:324)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:199)
at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:76)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2018-02-12 01:12:58,794 INFO [HiveServer2-Handler-Pool: Thread-106]: session.HiveSessionImpl (HiveSessionImpl.java:acquireAfterOpLock(342)) - We are setting the hadoop caller context to 5e6f48a9-7014-4d15-b02c-579557b5fb98 for thread HiveServer2-Handler-Pool: Thread-106
Additional note: The query writes the tmp files to 's3a://demo/' but then fails with the above error. Tmp files look like [hdfs@gkeys0 centos]$ hdfs dfs -ls -R s3a://demo/ drwxrwxrwx - hdfs hdfs 0 2018-02-12 02:12 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1
drwxrwxrwx - hdfs hdfs 0 2018-02-12 02:12 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1/-ext-10000
-rw-rw-rw- 1 hdfs hdfs 38106 2018-02-12 02:09 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1/-ext-10000/000000_0
-rw-rw-rw- 1 hdfs hdfs 6570 2018-02-12 02:09 s3a://demo/.hive-staging_hive_2018-02-12_02-08-27_090_2945283769634970656-1/-ext-10000/000001_0 Am I missing a config to set, or something like that?
... View more
Labels:
- Labels:
-
Apache Hive