Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Rising Star

Overview

The whole purpose of that list is to be able to search through all the entities contained within your data lake and identify specifically those entities which will lead your analysis. A Common complaint with the idea of the data lake is that you load too many files into the data lake and it becomes unwieldy. Atlas addresses this problem by providing powerful search tools to identify all of the Data entities located within the data lake. The purpose of this article is to explore the four Entity search techniques available within Atlas.

Atlas search options compared

The following chart summarizes the capabilities for each of the four atlas entity search techniques explored in this article. The sub sections below in this article we’ll go into more detail on how to set up the search techniques.

Attribute Entity Search Qualified Name Search DSL Search Full Text Search
Can identify multiple entities Yes No Yes Yes
Supports filtering No Yes Yes Yes
Free text No No No Yes
Ability to search on sub-attributes No No Yes No
Primary Value Listing all Entities of a given type Retrieve a specific entity record Complex searches of Entities Locating records primarily based on name, Comment and description fields

To simplify this article we are going to focus on just the ‘hive_table’ type, though keep in mind that for all the search examples covered here, you can use any Atlas type.

Preparations for the Search Examples

To support the search examples first were going to need to gather the following pieces of information:

Configuration Property Description Where to find it
Ranger user id The user id used to log into the Ranger UI screen. This login will be for a Ranger Administrative user with a default id of ‘admin’
Password The matching password to the Ranger User id. For this article I will be using the Ranger default password ‘admin’
Ranger Admin Server The Administrative server for Ranger. For this article, I will be using the FQDN ‘server1.hdp’ and also assuming the default port 21000. Go into Ambari, select the Ranger than QuickLinks. The host name for the Ranger UI is the proper value for this parameter
GUID A unique Entity identifier. Found in many of this article’s search result responses.
Fully Qualified Table Name A unique identifier for Hive tables. Is created by concatenating the following: {database name}.{table name}@{cluster Name}. So, if you had a database called ‘transports’ and a table named ‘drivers’ on a cluster named HDP, then the fully qualified table name would be ‘transports.drivers@HDP’.

For all examples contained within this article, to avoid creating overly long result listings, you will often see a “…” marker in the middle of a list. When you see this know that my intent was to spare you excessive scrolling when reviewing the result set.

All of the examples are presented in this article using the curl command as that is commonly available in Unix (Linux & Mac) based systems. For Python examples, you can find more detailed examples through this articles matching GitHub repository.

Atlas Entity Search Example

The Atlas Entity Search technique is the simplest of all of those explored in this article. It’s entire purpose is to retrieve all Entities of the specified type with no additional filtering enabled.

To retrieve a JSON list containing all the entities you will use the REST API command:

curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}:21000/api/atlas/entities?type=hive_table

Now let’s take a look at an actual example:

curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/entities?type=hive_table

In this example, we see as shown below the response contains two parts; (1) A header set containing an element count and RequestID, and (2) A list of GUIDs associated with all the Hive tables known to Atlas.

{ "count": 26,  
"requestId": "qtp1783047508-5870 - 867273a6-10e1-4c65-8da2-08a06b89d005",  
"results": [ 
"d3d637c5-df7e-4311-adc3-bcc6c4b81fb1",  
"cdbbd999-f789-4d2e-9127-b2443209b3b7",  
"848c05fa-f2d9-4482-8892-d4b4fc137ee6",   
…  
"03e38f24-577a-45ae-b67f-54a3e34f34ce", 
"4945f76b-6403-483a-b9aa-3161ce3e4bd6",  
"90d76c28-911e-425b-845f-5e1096eed3bb",  
"9571fb0e-52f5-4c16-a8f1-d2cf5138824c"  ],  
"typeName": "hive_table"
}  

Notice in the example above we find that there are 26 hive tables. Now unfortunately this sort of output by itself is not useful, so this API should be considered as one component in the entity retrieval. To actually see the entity full details you would want to run the REST call below using one of the GUID from the above response:

curl -iv -u admin:admin -X GET http://{Ranger
Admin Server}:21000/api/atlas/entities/{GUID}

Atlas Qualified Name Search

Often we will only want to look at one end of the record. In these cases, we can use the atlas qualified names search to retrieve a single entity Record. In Atlas for Hive tables the qualified name represents the concatenation of the database name the table name and finally the cluster team.

curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}::21000/api/atlas/entities?type=hive_table&property=qualifiedName&value={Fully
Qualified Table Name}

A result set from this query would cover all of the metadata captured by Atlas for the Fully Qualified Table Name as shown below (sorry showing it all so you will have to scroll):

{ 
"definition": { 
"id": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0  },  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [ 
"TLC"  ],  
"traits": { 
"TLC": { 
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", 
"typeName": "TLC",  
"values": {}  }  },  
"typeName": "hive_table",  
"values": { 
"aliases": null,   "columns":
[  { 
"id": { 
"id": "1690ccc2-d7be-45af-becb-c6b360a1a30f",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_column",  
"version": 0  }, 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [],  
"traits": {},  
"typeName": "hive_column",  
"values": { 
"comment": null,  
"description": null,    "name":
"driverid",  
"owner": "hive",  
"qualifiedName": "default.drivers.driverid@HDP",  
"table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0   },  
"type": "varchar(15)"  }  },   { 
"id": { 
"id": "249a7ce3-6b19-418e-9094-7d8a30bc596f",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_column",  
"version": 0  }, 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [ 
"CARRIER"  ], 
"traits": { 
"CARRIER": { 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",  
"typeName": "CARRIER",  
"values": {} 
}   },  
"typeName": "hive_column",  
"values": { 
"comment": null,  
"description": null,  
"name": "companyid",  
"owner": "hive",  
"qualifiedName": "default.drivers.companyid@HDP",  
"table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", 
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0 
},  
"type": "varchar(15)"  }  },   { 
"id": { 
"id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_column",  
"version": 0  }, 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [],  
"traits": {},  
"typeName": "hive_column",  
"values": { 
"comment": null,  
"description": null,    "name":
"customer",  
"owner": "hive",  
"qualifiedName": "default.drivers.customer@HDP",  
"table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0 
},  
"type": "varchar(40)"  }  },   { 
"id": { 
"id": "143479a3-be79-4f04-b649-4a09b5429ace",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_column",  
"version": 0  }, 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [],  
"traits": {},  
"typeName": "hive_column",  
"values": { 
"comment": null,  
"description": null,  
"name": "drivername",  
"owner": "hive",  
"qualifiedName": "default.drivers.drivername@HDP",   "table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0 
},  
"type": "varchar(75)"  }  },   { 
"id": { 
"id": "6c3123a9-0d09-490b-840d-6cc012ab69e0",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",    "typeName":
"hive_column",  
"version": 0  }, 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [],    "traits": {},  
"typeName": "hive_column",  
"values": { 
"comment": null,  
"description": null,  
"name": "yearsdriving",    "owner": "hive",  
"qualifiedName": "default.drivers.yearsdriving@HDP", 
"table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", 
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0 
},  
"type": "int"  }  },   { 
"id": { 
"id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_column",  
"version": 0  }, 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",  
"traitNames": [],  
"traits": {},  
"typeName": "hive_column",  
"values": { 
"comment": null,  
"description": null,  
"name": "riskscore",  
"owner": "hive",  
"qualifiedName": "default.drivers.riskscore@HDP",    "table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_table",  
"version": 0 
},  
"type": "varchar(5)"  }  }  ],  
"comment": null,  
"createTime": "2016-10-11T17:11:11.000Z",  
"db": { 
"id": "332189cc-d994-44c2-8f87-29a28a471434",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_db",  
"version": 0  },  
"description": "\"I get my answers from
HCC\"",  
"lastAccessTime": "2016-10-11T17:11:11.000Z",    "name": "drivers",  
"owner": "hive",  
"parameters": { 
"COLUMN_STATS_ACCURATE":
"{\"BASIC_STATS\":\"true\"}",  
"EXTERNAL": "TRUE",  
"numFiles": "1",  
"numRows": "4278",  
"rawDataSize": "1967880",  
"totalSize": "68597",  
"transient_lastDdlTime": "1476205880"  },  
"partitionKeys": null,  
"qualifiedName": "default.drivers@HDP",   "retention":
0,  
"sd": { 
"id": { 
"id": "36166469-1014-4645-98a6-9df34b37a145",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",  
"typeName": "hive_storagedesc",  
"version": 0  },  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",    "traitNames": [],  
"traits": {},  
"typeName": "hive_storagedesc",  
"values": { 
"bucketCols": null,  
"compressed": false,  
"inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat", 
"location":
"hdfs://server1.hdp:8020/apps/hive/warehouse/drivers",  
"numBuckets": -1,  
"outputFormat":
"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",  
"parameters": null,  
"qualifiedName": "default.drivers@HDP_storage",  
"serdeInfo": { 
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",   "typeName":
"hive_serde",  
"values": { 
"name": null,  
"parameters": { 
"serialization.format": "1" 
 },  
"serializationLib":
"org.apache.hadoop.hive.ql.io.orc.OrcSerde" 
}  }, 
"sortCols": null,  
"storedAsSubDirectories": false,  
"table": { 
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",  
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",  
"state": "ACTIVE",    "typeName":
"hive_table",  
"version": 0  }  }  },  
"tableType": "EXTERNAL_TABLE",  
"temporary": false,  
"viewExpandedText": null,    "viewOriginalText": null  }  },  
"requestId": "qtp1783047508-5870 -
4f0cc268-7fdc-49ac-a585-74f486ee3786"

You thing you will immediately notice is the output, like most JSON outputs is hierarchical. For qualified name searches as well as most of the others, it is only possible to search on the top levels. The one exception to this are Atlas Entity DSL searches which are covered in the next section.

Atlas Entity DSL search Example

The Atlas Entity DSL search is by far the most powerful of the options covered in this article. The DSL search enables selection of Entities based on a combination of properties, Atlas types, as well as sub properties within the entity definition. In addition, the DSL search allows to select only those properties desired for display so you don’t have to output monstrous documents.

DSL Search Example #1: SIMPLE DSL SEARCH BASED ON TABLENAME ONLY

This example is similar to the fully qualified table name example described earlier in this article. The primary difference, as shown below is the ability to search only on the table name and allow the data base and cluster name to stay as wildcards.

curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='{TableName}’

To make the query above work it must be properly encoded. As shown below the ‘+’ character is used as the spacing between keywords.

{  "count":
7,  
"dataType": { 
"attributeDefinitions": [  { 
"dataTypeName": "hive_db",  
"isComposite": false,  
"isIndexable": false,   "isUnique":
false,  
"multiplicity": { 
"isUnique": false,  
"lower": 1,  
"upper": 1  },  
"name": "db",  
"reverseAttributeName": null  }, …  },   "query":
"hive_table where name='drivers'",  
"queryType": "dsl",  
"requestId": "qtp1783047508-19 -
79fcb168-d047-4b70-9129-adebba09b323",  
"results": [  { … }

In the sample output above, we see that the sample query identified seven entities where are name is equal to tablename and the atlas type is a hive table. In addition, in the result header we see a list of each of the attributes output as part of the result set. Starting with the JSON element “query”, we see the query and the query type used. And then at the very end of the results set is a section titled called results which will output the detail for each entity found much as we saw earlier in the qualified cable name search in one of the sections above.

DLS Search Example #2: Limiting the number of rows output

There are several DSL search options which control things such as row limit, order by among others. This section Will show how they can be used through the use of the row limit option as shown below:

curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers'
limit 3

In the example above, we see the account of entities founded and returned is equal to three which exactly matches the limit specified in the query shown above. The contents of the attribute definitions as well as the resulting Entity properties is the same as the other DSL searches covered in this article.

{  "count":
3,  
"dataType": { 
"attributeDefinitions": [  { 
"dataTypeName": "hive_db",    "isComposite": false,  
"isIndexable": false,  
"isUnique": false,  
"multiplicity": { 
"isUnique": false,  
"lower": 1,  
"upper": 1   },  
"name": "db",  
"reverseAttributeName": null  },…

DSL Search Example #3: DSL SEARCH BASED ON TABLENAME ONLY DEMONSTRATING DISPLAYING ONLY SELECT PROPERTIES

As the prior examples illustrate, The DSL search options can how put a lot of data, but much of it useless for your analyses. To constrain the types of properties returned the atlas DSL search has the ‘select’ keyword to limit the properties actually output. The following example illustrates how to invoke the ‘select’ option.

In the example below, the output is constrained to just the properties “tableType, temporary, retention, qualifiedName, description, name, owner, comment, createTime” and reduces the response document output size:

curl -iv -u admin:admin -X GET
http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers'
select
tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime
limit 1

The result from the query above is shown in the listing below:

{  "count":
1,  
"dataType": { 
"attributeDefinitions": [  { 
"dataTypeName": "string",  
"isComposite": false,  
"isIndexable": false,  
"isUnique": false,  
"multiplicity": {   "isUnique": false,  
"lower": 0,  
"upper": 1  },  
"name": "tableType",  
"reverseAttributeName": null  }, …  },   "query":
"hive_table where name='drivers' select tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime
limit 1",  
"queryType": "dsl",  
"requestId": "qtp1783047508-21 -
347d7796-cce3-4eb6-8a40-59cedc07433a",  
"results": [  { 
"$typeName$": "__tempQueryResultStruct48",  
"comment": null,  
"createTime": "2016-10-11T16:50:03.000Z",  
"description": "\"Try this value again\"", 
"name": "drivers",  
"owner": "hive",  
"qualifiedName": "default.drivers@HDP",  
"retention": 0,  
"tableType": "EXTERNAL_TABLE",  
"temporary": false  }  ]}

In the example query above, only one entity was identified and as you can see in the results block only those property names specified after the select option art display rather than the long hierarchically deep output we’ve seen in prior examples.

DSL Search Example #4: DSL SEARCH FOR ENTITIES CONTAINING COLUMN NAME

As nice as limiting output to just the first-order fields may seem, often it is necessary to query on some arrays of sub-properties in the entity definition. One complex property often needed in data discovery queries in the ‘column’ property. If you want to search for all hive_tables where a specific field resides, possibly one used often as a foreign key, then you should check out this example:

curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table%2C+columns+where+name%3D%27twee...

In the DSL query above, you will see the encoded string %2C (‘,’) following after the data type. That comma is indicating that we want to look at the values assigned to the property name immediately following…which in this case is the property named ‘columns’. Specifically, the example above is requesting all Hive tables in which there exists a column named ‘tweet_id’. The use of the ’+’ to separate keywords is just as it has been for the other dsl queries, however, for the query of the sub-property it must be entirely encoded as you can see in the example above.

The result from the above example is:

{  "count":
1,  
"dataType": { 
"attributeDefinitions": [  {…  ],   "hierarchicalMetaTypeName":
"org.apache.atlas.typesystem.types.ClassType",  
"superTypes": [ 
"DataSet"  ],  
"typeDescription": null,  
"typeName": "hive_column"  },   "query":
"hive_table, columns where name='tweet_id'",  
"queryType": "dsl",  
"requestId": "qtp1783047508-5868 -
3db60f55-8e7b-409b-9a6f-2abea20b8371",  
"results": [  { 
"$id$": { 
"$typeName$": "hive_column",  
"id": "b32fc0ab-2f66-4ad1-9728-ccd1d48dbf32",  
"state": "ACTIVE",  
"version": 0  },… 

As we see in the output above, only 1 entity was found with a columname equal to ‘tweet_id’. You will also note in the example above that the output is the exact same for all of the hive_table types covered so far in this article. If you wanted to you could make this query more powerful to return a limit of values and only include a select set of properties in the results output.

Atlas Full Text Search Example

Our last type of query will free search the top level entity properties for the query string passed. The example below for example, is looking for any top level property which contains the string ‘sku’.

curl –iv -u admin:admin -X GET  http://server1:21000/api/atlas/discovery/search/fulltext?query=sku

The search results for this query option are similar to the entity list option explored at the start of this article in that it only lists the entitys’ GUID where the query is equal to the text supplied (‘sku’ in this example). One other interesting output property is the score. The ‘score’ property is an attempt to evaluate the search output quality as you find in many search engines. The output below shows a sample response from the query above:

{  "count":
4,   "query":
"sku",  
"queryType": "full-text",  
"requestId": "qtp1783047508-5868 -
5270285e-5bf4-43e5-82ed-46dc8901c461",  
"results": [  { 
"guid": "d37cfeab-afbc-41d0-8dda-71da29043bc3",  
"score": 0.7805047,   "typeName":
"hive_process"  },   { 
"guid": "b9b7dae2-775b-4c95-82b6-9d0ab2097a2c",  
"score": 0.65042055,  
"typeName": "hive_process"  },   { 
"guid": "c22f07fd-7522-4f37-97eb-96e3a9bc16bc",  
"score": 0.6008328,  
"typeName": "hive_column"  },   { 
"guid": "f23097c3-d836-4543-99d6-b08f6cdcc97a",  
"score": 0.6008328,  
"typeName": "hive_column"  }  ]}

Bibliography:

11,172 Views
Comments
avatar
Expert Contributor

This article is really very useful but has a silly but confusing (specially for HDP newbies) error where all occurrences of "Ranger user id" and "Ranger Admin Server" must be replaced by "Atlas User ID" and "Atlas Admin Server" respectively.