Created 01-29-2019 04:08 PM
I want to be able to tag a list of Hive columns in Atlas (HDP 2.6.5) from a Python script. For this I need to find the guid of the Hive column I'm targetting. But I can't seem to tame the Atlas REST API to do that.
Getting all the Hive columns in the data lake is easy enough:
curl -u myaccount -i -H "Content-Type: application/json" -X GET 'https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+__state=%27ACTIVE%27+select+qualifiedName,name,__guid'
That works (the number of Hive columns is limited to 100 by default):
{ "requestId": "pool-2-thread-6 - 7e8ec2ee-a94b-4e37-b851-57b1b3df6e29", "query": "hive_column where __state='ACTIVE' select qualifiedName,name,__guid", "queryType": "dsl", "count": 100, "results": [ { "$typeName$": "__tempQueryResultStruct5", "qualifiedName": "mytestdb.employee.name@CLUSTERNAME", "__guid": "809c84d5-e065-45f4-81c8-fcdb7cf81560", "name": "name" }, { "$typeName$": "__tempQueryResultStruct5", "qualifiedName": "mytestdb.othertest.id@CLUSTERNAME", "__guid": "04cd03fb-31d3-43e0-a75a-f35f2a60ad22", "name": "id" }, etcetera.. } }
Now let's pick just one hive_column:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+name='id'"
Which results in a 500 server error.
Really? Okay, let's try selecting on the qualifiedName of the column:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'"
Don't think so: HTTP ERROR 500.
Okay, wise guy. How about I query the table, but then really ask for the qualifiedName of the column?
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'"
That works, but I don't get any data on the column.
Okay, let's have a look at the fields in the output. Let's ask for all the Hive columns again, but don't put in the select.
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&__state=%27ACTIVE%27"
I get results, but mixed in there are deleted states. For example:
{ "$typeName$": "hive_column", "$id$": { "id": "b3cc1db6-9c95-499c-9f5f-c457a4e4add6", "$typeName$": "hive_column", "version": 0, "state": "DELETED" }, "$systemAttributes$": { "createdBy": "myaccount", "modifiedBy": "myaccount", "createdTime": "Wed Jun 27 15:07:24 CEST 2018", "modifiedTime": "Mon Aug 06 16:30:30 CEST 2018" }, "comment": null, "qualifiedName": "mytestdb.othertest.id@CLUSTERNAME", "type": "string", "position": 30, "owner": "myaccount", "description": null, "name": "id", "table": { "id": "161b4ef8-d5de-4e38-a76d-c030a71b730c", "$typeName$": "hive_table", "version": 0, "state": "DELETED" } },
Okay, let's remove DELETED states in all the fields.
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&results.table.state=%27ACTIVE%27&__state=%27ACTIVE%27&__id.state=%27ACTIVE%27"
Nice try, but here are the DELETED states again.
I feel like I'm playing a chess match agains the Atlas REST API and I'm not winning. Is there any way to do a query against the Atlas REST API to just get the qualifiedName and guid of one Hive column?
Created 02-07-2019 02:28 PM
This is now the winning REST API query:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName%3D%27testdb.mytable.id@CLUSTERNAME%27"
It gives a list of all columns for a table, including deleted ones. In my Python code I pick the column with ACTIVE state.
Created 02-06-2019 03:55 PM
I've been another half day at it. I come to the conclusion that the best that I can get is a list of all columns in a table, including ones with a DELETED state.
Created 02-07-2019 02:28 PM
This is now the winning REST API query:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName%3D%27testdb.mytable.id@CLUSTERNAME%27"
It gives a list of all columns for a table, including deleted ones. In my Python code I pick the column with ACTIVE state.