Created on 10-19-2016 01:22 PM - edited 08-17-2019 08:45 AM
This article reviews the steps necessary to update Hive entities within Atlas the Description and Comment fields. The 0.70 Atlas release will display and allow text searches on the ‘‘description’ field, but the Atlas UI does not at this time support the ability to manually enter those properties into a given data Asset.
Examined in this article includes:
In release 0.70, Atlas has the ability to monitor additions as well as changes to Hive table and Hive columns. When Atlas identifies a new entry or change the appropriate Metadata property is updated for that entity. One very cool aspect to Atlas is the ability to conduct either DSL or free text searches on any properties set for the entity. Anyone trying to identify datasets to support a specific analytic activity will definitely appreciate the ability search through all of the entities and quickly discover valuable data assets in the data lake without having to relying on tribal knowledge.
For this Article we will update a specific table based on its full qualified name and then assign a new description field to the table. The full source code for the examples covered in this article on GitHub. The code for this example is written in Python and there is a full set of instructions in the repository README.md file.
Now let’s assume that in our ‘HDP’ cluster within the ‘default’ database there exists a table named ‘drivers’. For this table, our objective is to change the ‘description’ property from its current value to a value of ‘I get my answers from HCC’. Entity property updates are made one at a time, so our first step is to collect the Guid for our target table.
As this article is about the update of a property within an Hive_table Entity, we will limit the search coverage to identifying a unique Hive_table. The query values for this example are:
Property | Value used in this article | Comments on how to change the provided values for your cluster. |
Atas server FQDN | server1.hdp | Use your server's Atlas Metadataserver FQDN |
entityType | hive_table | Can be any valid Atlas Type |
database name | default | Specify your table's database name. |
table name | drivers | This can be any Hive Table whose metadata is already in Atlas. The table name you provide must already exist on your specified cluster. |
Cluster name | HDP | The name of your cluster |
An Atlas entity can be any variety of types. The beauty of this architecture is the same search steps are available whether seeking a table, a hive column, or some other Atlas managed type. The format we will use for this search example is:
HTTP://{Atlas server FQDN}:21000/api/atlas/entities?type={entitytype}&property=qualifiedName&value={databasename}.{table name}@{Cluster name}
So for our example, the exact REST query would be:
http://server1.hdp:21000/api/atlas/entities?type=hive_table&property=qualifiedName&value=default.dri...
The full result as shown below from this REST query will contain the guid necessary for the update along with all of the hive_table’s metadata information as shown below:
{ "definition": { "id": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "TLC" ], "traits": { "TLC": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "TLC", "values": {} } }, "typeName": "hive_table", "values": { "aliases": null, "columns": [ { "id": { "id": "1690ccc2-d7be-45af-becb-c6b360a1a30f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "driverid", "owner": "hive", "qualifiedName": "default.drivers.driverid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "249a7ce3-6b19-418e-9094-7d8a30bc596f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "CARRIER" ], "traits": { "CARRIER": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "CARRIER", "values": {} } }, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "companyid", "owner": "hive", "qualifiedName": "default.drivers.companyid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "customer", "owner": "hive", "qualifiedName": "default.drivers.customer@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(40)" } }, { "id": { "id": "143479a3-be79-4f04-b649-4a09b5429ace", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "drivername", "owner": "hive", "qualifiedName": "default.drivers.drivername@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(75)" } }, { "id": { "id": "6c3123a9-0d09-490b-840d-6cc012ab69e0", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "yearsdriving", "owner": "hive", "qualifiedName": "default.drivers.yearsdriving@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "int" } }, { "id": { "id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "riskscore", "owner": "hive", "qualifiedName": "default.drivers.riskscore@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(5)" } } ], "comment": null, "createTime": "2016-10-11T17:11:11.000Z", "db": { "id": "332189cc-d994-44c2-8f87-29a28a471434", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_db", "version": 0 }, "description": "\"changeMe\"", "lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers", "owner": "hive", "parameters": { "COLUMN_STATS_ACCURATE": "{\"BASIC_STATS\":\"true\"}", "EXTERNAL": "TRUE", "numFiles": "1", "numRows": "4278", "rawDataSize": "1967880", "totalSize": "68597", "transient_lastDdlTime": "1476205880" }, "partitionKeys": null, "qualifiedName": "default.drivers@HDP", "retention": 0, "sd": { "id": { "id": "36166469-1014-4645-98a6-9df34b37a145", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_storagedesc", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_storagedesc", "values": { "bucketCols": null, "compressed": false, "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat", "location": "hdfs://server1.hdp:8020/apps/hive/warehouse/drivers", "numBuckets": -1, "outputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat", "parameters": null, "qualifiedName": "default.drivers@HDP_storage", "serdeInfo": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "hive_serde", "values": { "name": null, "parameters": { "serialization.format": "1" }, "serializationLib": "org.apache.hadoop.hive.ql.io.orc.OrcSerde" } }, "sortCols": null, "storedAsSubDirectories": false, "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 } } }, "tableType": "EXTERNAL_TABLE", "temporary": false, "viewExpandedText": null, "viewOriginalText": null } }, "requestId": "qtp511473681-34831 - b088be5b-44e6-4a2c-bd4a-7beeb059cf4f"}
In the result set above, locate the "id" property value which is the GUID and the "description" property with the current value of "changeMe".
In this case we will use the REST query results definition.id.id value of ‘b78b5541-a205-4f9e-8b81-e20632a88ad5’ to support our next REST query to update the property value. We can also see in the ‘description’ field which is highlighted in bold currently has the value of “changeMe”.
Now that we have the GUID, it is time to update the ‘description’ property from ‘changeMe’ to ‘I get my answers from HCC’.
The update entity property REST command requires the GUID from the prior search step. To update the property, we will use the POST entity Atlas REST Command rolling the url query format and include the string "I get my answers from HCC" in the post message payload:
http://{Atlas server FQDN}:21000/api/atlas/entities/{GUID from prior search operation}?property={atlas property field name}
So to finish our example, with our payload containing the string "I get my answers from HCC", the actual query would be:
http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description
The result from the above command will be the current Metadata definition for our drivers table in JSON format as shown below:
{… "description": "\"I get my answers from HCC\"", "lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers", "owner": "hive", "parameters": { "COLUMN_STATS_ACCURATE": "{\"BASIC_STATS\":\"true\"}", "EXTERNAL": "TRUE", "numFiles": "1", "numRows": "4278", "rawDataSize": "1967880", "totalSize": "68597", "transient_lastDdlTime": "1476205880"}
Now let's go take a look at the Atlas UI, and check on the description for the drivers table. As we see in the screen print below, the new description property value has been successfully changed:
This article attempts to take a simple property change example to illustrate the techniques necessary to modify the Atlas Metadata for a given entity. After you have completely run through this example, so follow on activities to experiment with include:
Created on 06-23-2017 11:54 AM
@mjohnson Thanks for detailed explanation on updating entities. I have a query in the command you used to updated the description of entity.
The command you used to update the description doesn't contain the actual string that needs to be replaced.
Do we need to add it in the command while executing? something like the below
http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description:"I get my answers from HCC"
Thanks