Created on 02-06-2017 06:59 PM
Atlas provides powerful Tagging capabilities which Data Analysts to identify all data sets containing specific types of data. The Atlas UI itself provides a powerful Tag based search capability which require no REST API interaction. However, for those of you out there who need to integrate Tag based search with some of their data discovery and governance activities, this posting is for you. Within this posting are some instructions regarding how you can use the Atlas REST API to retrieve entity data based on a TAG name.
Before getting too deep into the Atlas Tag search examples it is important to recognize that Atlas Tags are basically a form of an Atlas type. If you invoke the REST API command “/api/atlas/types”, in the summary output below interspersed between standard Atlas types such as ‘hive_table’, ‘jms_topic’, etc., will be the current set of user defined Atlas Tags (CUSTOMER & SALES) as shown below:
"count": 35, "requestId": "qtp1177377518-81 - c7d4a853-02a0-4a1e-9b50-f7375f6e5f08", "results": [ "falcon_feed_replication", "falcon_process", "DataSet", "falcon_feed_creation", "file_action", "hive_order", "Process", "hive_table", "hive_db", … "Infrastructure", "CUSTOMER", "Asset", "storm_spout", "SALES", "hive_column", … ]
In the rest of the article we will expand on the Atlas types API to explore how we can perform two different types of TAG based searches. Before going too far it is important to note that the source code for the following examples are available through this repo.
In our first Tag search example our objective is to return a list of Atlas Data Entities which have the query TAG name assigned. In this example, we are going to search our atlas instance on (‘server1’ port 21000) for all Atlas entities with a tag named CUSTOMER. You will want to replace CUSTOMER with an existing tag on your system.
Our Atlas DSL query to find the CUSTOMER tag using the ‘curl’ command is as shown below:
curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=CUSTOMER
The example above returns a list of the entity guids which have the Atlas Tag ‘CUSTOMER’ defined to the Atlas host ‘server1’ on port 21000. To run this query on your own cluster or on a sandbox just substitute the Atlas Server Host URL, Atlas Server Port number, login information and your Tag name and then invoke as shown above with curl (or SimpleAtlasTagSearch.py in the Python example in the referenced Repo at the end of this article).
An output from this REST API query on my cluster is shown below:
{ "count": 2, "dataType": { "attributeDefinitions": [ … ], "typeDescription": null, "typeName": "__tempQueryResultStruct120" }, "query": "CUSTOMER", "queryType": "dsl", "requestId": "qtp1177377518-81 - 624fc6b9-e3cc-4ab7-80ba-c6a57d6ef3fd", "results": [ { "$typeName$": "__tempQueryResultStruct120", "instanceInfo": { "$typeName$": "__IdType", "guid": "806362dc-0709-47ca-af16-fac81184c130", "state": "ACTIVE", "typeName": "hive_table" }, "traitDetails": null }, { "$typeName$": "__tempQueryResultStruct120", "instanceInfo": { "$typeName$": "__IdType", "guid": "4138c963-b20d-4d10-b338-2c334202af43", "state": "ACTIVE", "typeName": "hive_table" }, "traitDetails": null } ] }
The results from this query can be thought of having 3 sections:
For our purposes we are really only interested in the list of entities, so all you need to do is focus on extracting the important information from the .results jsonpath object in the return json object. Looking at the results section we observe that only one entity has the CUSTOMER tag assigned. This entity located by the search has the guid assigned of ‘4138c963-b20d-4d10-b338-2c334202af43’ we see is an active entity (not deleted). We can now use the entity search capabilities to retrieve the actual entity as described in the next example within this article.
The beauty of Example #1 is we can build an entity list using a single REST API call. However, for the real world we will want access to details about the assigned entities. To accomplish this, we will need a programming interface such as Python, Java, Scala, bash what your favorite tool is, etc. to pull the GUIDs and then perform entity searches.
For the purposes of this posting, we will use Python to illustrate how to perform more powerful Atlas Tag searches. The example below performs two Atlas REST API queries to build a json object containing the details and not just guids for the entities with our Tag assigned.
def atlasGET( restAPI ) :<br> url = "http://"+ATLAS_DOMAIN+":"+ATLAS_PORT+restAPI<br> r= requests.get(url, auth=("admin", "admin"))return(json.loads(r.text)); results = atlasGET("/api/atlas/discovery/search/dsl?query={0}".format(TAG_NAME)) entityGuidList = results['results'] entityList = [] for entity in entityGuidList: guid = entity['instanceInfo']['guid'] entityDetail = atlasGET("/api/atlas/entities/{0}".format(guid)) entityList.append(entityDetail); print json.dumps(entityList, indent=4, sort_keys=True)
The output from this script is now available for more sophisticated data governance and data discovery projects.
As powerful as both the Atlas UI and Atlas REST API Tag based searches are, there are some limitations to be aware: