Created on 10-26-2016 07:44 AM
The whole purpose of that list is to be able to search through all the entities contained within your data lake and identify specifically those entities which will lead your analysis. A Common complaint with the idea of the data lake is that you load too many files into the data lake and it becomes unwieldy. Atlas addresses this problem by providing powerful search tools to identify all of the Data entities located within the data lake. The purpose of this article is to explore the four Entity search techniques available within Atlas.
The following chart summarizes the capabilities for each of the four atlas entity search techniques explored in this article. The sub sections below in this article we’ll go into more detail on how to set up the search techniques.
Attribute | Entity Search | Qualified Name Search | DSL Search | Full Text Search |
Can identify multiple entities | Yes | No | Yes | Yes |
Supports filtering | No | Yes | Yes | Yes |
Free text | No | No | No | Yes |
Ability to search on sub-attributes | No | No | Yes | No |
Primary Value | Listing all Entities of a given type | Retrieve a specific entity record | Complex searches of Entities | Locating records primarily based on name, Comment and description fields |
To simplify this article we are going to focus on just the ‘hive_table’ type, though keep in mind that for all the search examples covered here, you can use any Atlas type.
To support the search examples first were going to need to gather the following pieces of information:
Configuration Property | Description | Where to find it |
Ranger user id | The user id used to log into the Ranger UI screen. This login will be for a Ranger Administrative user with a default id of ‘admin’ | |
Password | The matching password to the Ranger User id. For this article I will be using the Ranger default password ‘admin’ | |
Ranger Admin Server | The Administrative server for Ranger. For this article, I will be using the FQDN ‘server1.hdp’ and also assuming the default port 21000. | Go into Ambari, select the Ranger than QuickLinks. The host name for the Ranger UI is the proper value for this parameter |
GUID | A unique Entity identifier. | Found in many of this article’s search result responses. |
Fully Qualified Table Name | A unique identifier for Hive tables. | Is created by concatenating the following: {database name}.{table name}@{cluster Name}. So, if you had a database called ‘transports’ and a table named ‘drivers’ on a cluster named HDP, then the fully qualified table name would be ‘transports.drivers@HDP’. |
For all examples contained within this article, to avoid creating overly long result listings, you will often see a “…” marker in the middle of a list. When you see this know that my intent was to spare you excessive scrolling when reviewing the result set.
All of the examples are presented in this article using the curl command as that is commonly available in Unix (Linux & Mac) based systems. For Python examples, you can find more detailed examples through this articles matching GitHub repository.
The Atlas Entity Search technique is the simplest of all of those explored in this article. It’s entire purpose is to retrieve all Entities of the specified type with no additional filtering enabled.
To retrieve a JSON list containing all the entities you will use the REST API command:
curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}:21000/api/atlas/entities?type=hive_table
Now let’s take a look at an actual example:
curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/entities?type=hive_table
In this example, we see as shown below the response contains two parts; (1) A header set containing an element count and RequestID, and (2) A list of GUIDs associated with all the Hive tables known to Atlas.
{ "count": 26, "requestId": "qtp1783047508-5870 - 867273a6-10e1-4c65-8da2-08a06b89d005", "results": [ "d3d637c5-df7e-4311-adc3-bcc6c4b81fb1", "cdbbd999-f789-4d2e-9127-b2443209b3b7", "848c05fa-f2d9-4482-8892-d4b4fc137ee6", … "03e38f24-577a-45ae-b67f-54a3e34f34ce", "4945f76b-6403-483a-b9aa-3161ce3e4bd6", "90d76c28-911e-425b-845f-5e1096eed3bb", "9571fb0e-52f5-4c16-a8f1-d2cf5138824c" ], "typeName": "hive_table" }
Notice in the example above we find that there are 26 hive tables. Now unfortunately this sort of output by itself is not useful, so this API should be considered as one component in the entity retrieval. To actually see the entity full details you would want to run the REST call below using one of the GUID from the above response:
curl -iv -u admin:admin -X GET http://{Ranger Admin Server}:21000/api/atlas/entities/{GUID}
Often we will only want to look at one end of the record. In these cases, we can use the atlas qualified names search to retrieve a single entity Record. In Atlas for Hive tables the qualified name represents the concatenation of the database name the table name and finally the cluster team.
curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}::21000/api/atlas/entities?type=hive_table&property=qualifiedName&value={Fully Qualified Table Name}
A result set from this query would cover all of the metadata captured by Atlas for the Fully Qualified Table Name as shown below (sorry showing it all so you will have to scroll):
{ "definition": { "id": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "TLC" ], "traits": { "TLC": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "TLC", "values": {} } }, "typeName": "hive_table", "values": { "aliases": null, "columns": [ { "id": { "id": "1690ccc2-d7be-45af-becb-c6b360a1a30f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "driverid", "owner": "hive", "qualifiedName": "default.drivers.driverid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "249a7ce3-6b19-418e-9094-7d8a30bc596f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "CARRIER" ], "traits": { "CARRIER": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "CARRIER", "values": {} } }, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "companyid", "owner": "hive", "qualifiedName": "default.drivers.companyid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "customer", "owner": "hive", "qualifiedName": "default.drivers.customer@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(40)" } }, { "id": { "id": "143479a3-be79-4f04-b649-4a09b5429ace", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "drivername", "owner": "hive", "qualifiedName": "default.drivers.drivername@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(75)" } }, { "id": { "id": "6c3123a9-0d09-490b-840d-6cc012ab69e0", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "yearsdriving", "owner": "hive", "qualifiedName": "default.drivers.yearsdriving@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "int" } }, { "id": { "id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "riskscore", "owner": "hive", "qualifiedName": "default.drivers.riskscore@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(5)" } } ], "comment": null, "createTime": "2016-10-11T17:11:11.000Z", "db": { "id": "332189cc-d994-44c2-8f87-29a28a471434", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_db", "version": 0 }, "description": "\"I get my answers from HCC\"", "lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers", "owner": "hive", "parameters": { "COLUMN_STATS_ACCURATE": "{\"BASIC_STATS\":\"true\"}", "EXTERNAL": "TRUE", "numFiles": "1", "numRows": "4278", "rawDataSize": "1967880", "totalSize": "68597", "transient_lastDdlTime": "1476205880" }, "partitionKeys": null, "qualifiedName": "default.drivers@HDP", "retention": 0, "sd": { "id": { "id": "36166469-1014-4645-98a6-9df34b37a145", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_storagedesc", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_storagedesc", "values": { "bucketCols": null, "compressed": false, "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat", "location": "hdfs://server1.hdp:8020/apps/hive/warehouse/drivers", "numBuckets": -1, "outputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat", "parameters": null, "qualifiedName": "default.drivers@HDP_storage", "serdeInfo": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "hive_serde", "values": { "name": null, "parameters": { "serialization.format": "1" }, "serializationLib": "org.apache.hadoop.hive.ql.io.orc.OrcSerde" } }, "sortCols": null, "storedAsSubDirectories": false, "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 } } }, "tableType": "EXTERNAL_TABLE", "temporary": false, "viewExpandedText": null, "viewOriginalText": null } }, "requestId": "qtp1783047508-5870 - 4f0cc268-7fdc-49ac-a585-74f486ee3786"
You thing you will immediately notice is the output, like most JSON outputs is hierarchical. For qualified name searches as well as most of the others, it is only possible to search on the top levels. The one exception to this are Atlas Entity DSL searches which are covered in the next section.
The Atlas Entity DSL search is by far the most powerful of the options covered in this article. The DSL search enables selection of Entities based on a combination of properties, Atlas types, as well as sub properties within the entity definition. In addition, the DSL search allows to select only those properties desired for display so you don’t have to output monstrous documents.
This example is similar to the fully qualified table name example described earlier in this article. The primary difference, as shown below is the ability to search only on the table name and allow the data base and cluster name to stay as wildcards.
curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='{TableName}’
To make the query above work it must be properly encoded. As shown below the ‘+’ character is used as the spacing between keywords.
{ "count": 7, "dataType": { "attributeDefinitions": [ { "dataTypeName": "hive_db", "isComposite": false, "isIndexable": false, "isUnique": false, "multiplicity": { "isUnique": false, "lower": 1, "upper": 1 }, "name": "db", "reverseAttributeName": null }, … }, "query": "hive_table where name='drivers'", "queryType": "dsl", "requestId": "qtp1783047508-19 - 79fcb168-d047-4b70-9129-adebba09b323", "results": [ { … }
In the sample output above, we see that the sample query identified seven entities where are name is equal to tablename and the atlas type is a hive table. In addition, in the result header we see a list of each of the attributes output as part of the result set. Starting with the JSON element “query”, we see the query and the query type used. And then at the very end of the results set is a section titled called results which will output the detail for each entity found much as we saw earlier in the qualified cable name search in one of the sections above.
There are several DSL search options which control things such as row limit, order by among others. This section Will show how they can be used through the use of the row limit option as shown below:
curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers' limit 3
In the example above, we see the account of entities founded and returned is equal to three which exactly matches the limit specified in the query shown above. The contents of the attribute definitions as well as the resulting Entity properties is the same as the other DSL searches covered in this article.
{ "count": 3, "dataType": { "attributeDefinitions": [ { "dataTypeName": "hive_db", "isComposite": false, "isIndexable": false, "isUnique": false, "multiplicity": { "isUnique": false, "lower": 1, "upper": 1 }, "name": "db", "reverseAttributeName": null },…
As the prior examples illustrate, The DSL search options can how put a lot of data, but much of it useless for your analyses. To constrain the types of properties returned the atlas DSL search has the ‘select’ keyword to limit the properties actually output. The following example illustrates how to invoke the ‘select’ option.
In the example below, the output is constrained to just the properties “tableType, temporary, retention, qualifiedName, description, name, owner, comment, createTime” and reduces the response document output size:
curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers' select tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime limit 1
The result from the query above is shown in the listing below:
{ "count": 1, "dataType": { "attributeDefinitions": [ { "dataTypeName": "string", "isComposite": false, "isIndexable": false, "isUnique": false, "multiplicity": { "isUnique": false, "lower": 0, "upper": 1 }, "name": "tableType", "reverseAttributeName": null }, … }, "query": "hive_table where name='drivers' select tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime limit 1", "queryType": "dsl", "requestId": "qtp1783047508-21 - 347d7796-cce3-4eb6-8a40-59cedc07433a", "results": [ { "$typeName$": "__tempQueryResultStruct48", "comment": null, "createTime": "2016-10-11T16:50:03.000Z", "description": "\"Try this value again\"", "name": "drivers", "owner": "hive", "qualifiedName": "default.drivers@HDP", "retention": 0, "tableType": "EXTERNAL_TABLE", "temporary": false } ]}
In the example query above, only one entity was identified and as you can see in the results block only those property names specified after the select option art display rather than the long hierarchically deep output we’ve seen in prior examples.
As nice as limiting output to just the first-order fields may seem, often it is necessary to query on some arrays of sub-properties in the entity definition. One complex property often needed in data discovery queries in the ‘column’ property. If you want to search for all hive_tables where a specific field resides, possibly one used often as a foreign key, then you should check out this example:
curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table%2C+columns+where+name%3D%27twee...
In the DSL query above, you will see the encoded string %2C (‘,’) following after the data type. That comma is indicating that we want to look at the values assigned to the property name immediately following…which in this case is the property named ‘columns’. Specifically, the example above is requesting all Hive tables in which there exists a column named ‘tweet_id’. The use of the ’+’ to separate keywords is just as it has been for the other dsl queries, however, for the query of the sub-property it must be entirely encoded as you can see in the example above.
The result from the above example is:
{ "count": 1, "dataType": { "attributeDefinitions": [ {… ], "hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType", "superTypes": [ "DataSet" ], "typeDescription": null, "typeName": "hive_column" }, "query": "hive_table, columns where name='tweet_id'", "queryType": "dsl", "requestId": "qtp1783047508-5868 - 3db60f55-8e7b-409b-9a6f-2abea20b8371", "results": [ { "$id$": { "$typeName$": "hive_column", "id": "b32fc0ab-2f66-4ad1-9728-ccd1d48dbf32", "state": "ACTIVE", "version": 0 },…
As we see in the output above, only 1 entity was found with a columname equal to ‘tweet_id’. You will also note in the example above that the output is the exact same for all of the hive_table types covered so far in this article. If you wanted to you could make this query more powerful to return a limit of values and only include a select set of properties in the results output.
Our last type of query will free search the top level entity properties for the query string passed. The example below for example, is looking for any top level property which contains the string ‘sku’.
curl –iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/fulltext?query=sku
The search results for this query option are similar to the entity list option explored at the start of this article in that it only lists the entitys’ GUID where the query is equal to the text supplied (‘sku’ in this example). One other interesting output property is the score. The ‘score’ property is an attempt to evaluate the search output quality as you find in many search engines. The output below shows a sample response from the query above:
{ "count": 4, "query": "sku", "queryType": "full-text", "requestId": "qtp1783047508-5868 - 5270285e-5bf4-43e5-82ed-46dc8901c461", "results": [ { "guid": "d37cfeab-afbc-41d0-8dda-71da29043bc3", "score": 0.7805047, "typeName": "hive_process" }, { "guid": "b9b7dae2-775b-4c95-82b6-9d0ab2097a2c", "score": 0.65042055, "typeName": "hive_process" }, { "guid": "c22f07fd-7522-4f37-97eb-96e3a9bc16bc", "score": 0.6008328, "typeName": "hive_column" }, { "guid": "f23097c3-d836-4543-99d6-b08f6cdcc97a", "score": 0.6008328, "typeName": "hive_column" } ]}
Created on 09-16-2018 12:41 AM
This article is really very useful but has a silly but confusing (specially for HDP newbies) error where all occurrences of "Ranger user id" and "Ranger Admin Server" must be replaced by "Atlas User ID" and "Atlas Admin Server" respectively.