Created 01-29-2019 04:08 PM
I want to be able to tag a list of Hive columns in Atlas (HDP 2.6.5) from a Python script. For this I need to find the guid of the Hive column I'm targetting. But I can't seem to tame the Atlas REST API to do that.
Getting all the Hive columns in the data lake is easy enough:
curl -u myaccount -i -H "Content-Type: application/json" -X GET 'https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+__state=%27ACTIVE%27+select+qualifiedName,name,__guid'
That works (the number of Hive columns is limited to 100 by default):
{
"requestId": "pool-2-thread-6 - 7e8ec2ee-a94b-4e37-b851-57b1b3df6e29",
"query": "hive_column where __state='ACTIVE' select qualifiedName,name,__guid",
"queryType": "dsl",
"count": 100,
"results": [
{
"$typeName$": "__tempQueryResultStruct5",
"qualifiedName": "mytestdb.employee.name@CLUSTERNAME",
"__guid": "809c84d5-e065-45f4-81c8-fcdb7cf81560",
"name": "name"
},
{
"$typeName$": "__tempQueryResultStruct5",
"qualifiedName": "mytestdb.othertest.id@CLUSTERNAME",
"__guid": "04cd03fb-31d3-43e0-a75a-f35f2a60ad22",
"name": "id"
},
etcetera..
}
}Now let's pick just one hive_column:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+name='id'"
Which results in a 500 server error.
Really? Okay, let's try selecting on the qualifiedName of the column:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'"
Don't think so: HTTP ERROR 500.
Okay, wise guy. How about I query the table, but then really ask for the qualifiedName of the column?
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName='mytestdb.othertest.id@CLUSTERNAME'"
That works, but I don't get any data on the column.
Okay, let's have a look at the fields in the output. Let's ask for all the Hive columns again, but don't put in the select.
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&__state=%27ACTIVE%27"
I get results, but mixed in there are deleted states. For example:
{
"$typeName$": "hive_column",
"$id$": {
"id": "b3cc1db6-9c95-499c-9f5f-c457a4e4add6",
"$typeName$": "hive_column",
"version": 0,
"state": "DELETED"
},
"$systemAttributes$": {
"createdBy": "myaccount",
"modifiedBy": "myaccount",
"createdTime": "Wed Jun 27 15:07:24 CEST 2018",
"modifiedTime": "Mon Aug 06 16:30:30 CEST 2018"
},
"comment": null,
"qualifiedName": "mytestdb.othertest.id@CLUSTERNAME",
"type": "string",
"position": 30,
"owner": "myaccount",
"description": null,
"name": "id",
"table": {
"id": "161b4ef8-d5de-4e38-a76d-c030a71b730c",
"$typeName$": "hive_table",
"version": 0,
"state": "DELETED"
}
},Okay, let's remove DELETED states in all the fields.
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_column&results.table.state=%27ACTIVE%27&__state=%27ACTIVE%27&__id.state=%27ACTIVE%27"
Nice try, but here are the DELETED states again.
I feel like I'm playing a chess match agains the Atlas REST API and I'm not winning. Is there any way to do a query against the Atlas REST API to just get the qualifiedName and guid of one Hive column?
Created 02-07-2019 02:28 PM
This is now the winning REST API query:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName%3D%27testdb.mytable.id@CLUSTERNAME%27"
It gives a list of all columns for a table, including deleted ones. In my Python code I pick the column with ACTIVE state.
Created 02-06-2019 03:55 PM
I've been another half day at it. I come to the conclusion that the best that I can get is a list of all columns in a table, including ones with a DELETED state.
Created 02-07-2019 02:28 PM
This is now the winning REST API query:
curl -u myaccount -i -H "Content-Type: application/json" -X GET "https://atlasnode.domain.com:21000/api/atlas/discovery/search/dsl?query=hive_table+where+qualifiedName%3D%27testdb.mytable.id@CLUSTERNAME%27"
It gives a list of all columns for a table, including deleted ones. In my Python code I pick the column with ACTIVE state.