Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to delete lineage metadata in apache atlas?

avatar
Super Collaborator

Hi Guys,

I am using Atlas-Ranger Sandbox machine,on which i have executed some hive queries and getting lineage of tables in atlas UI,but

First time when i have executed hive query at that time, my input tables were coming from "medical" database (condition_info and patient_information table) as shown in diagram.After some time we decided to take all input tables from "EMR" database then we have executed same query and created "patient_cohort_table" and now in atlas UI i am getting lineage for "medical" database too even though i have deleted those tables from hive.

In lineage diagram,the user only able to see lineage for emr database and not for medical database.To do this we need to delete lineage metdadata from apache atlas so

How to delete lineage metadata so that we will not get it's lineage in apache atlas?

Thanks in advance,

please find attached lineage.

6989-lineage.png

1 ACCEPTED SOLUTION

avatar
Guru

@Manoj Dhake

There are several super types in Atlas that most of the existing types inherit from. Two key super types are Process and DataSet. The Process type has two fields that play a key role in Lineage tracking, Inputs and Outputs.

{"typeName":"Process","definition":{"enumTypes":[],"structTypes":[],"traitTypes":[],"classTypes":[{"superTypes":["Referenceable","Asset"],"hierarchicalMetaTypeName":"org.apache.atlas.typesystem.types.ClassType","typeName":"Process","typeDescription":null,"attributeDefinitions":[{"name":"inputs","dataTypeName":"array<DataSet>","multiplicity":"optional","isComposite":false,"isUnique":false,"isIndexable":true,"reverseAttributeName":null},{"name":"outputs","dataTypeName":"array<DataSet>","multiplicity":"optional","isComposite":false,"isUnique":false,"isIndexable":true,"reverseAttributeName":null}]}]},"requestId":"qtp1853177759-436 - 00d1cf83-1bc6-4b49-820f-d907e42c4c27"}

In your case, the "create table if" entities are Process types and posses the Input and Output attributes. The reasons Atlas knows to connect EMR.PATIENT... and EMR.CONDITION... entities to the first "create table if" entity is that EMR.PATIENT and EMR.CONDITION are both entities based on DataSet types that are referenced in the Input field of "create table if" entity. Similarly the PATIENT360 entity is also of DataSet type that is referenced in the Output field of that same "create table if" entity. Here is a generic example with a Hive Table:

{"requestId":"qtp1853177759-388 - a98ad750-6fd7-41e9-8fbd-4117c844f8d1","definition":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference","id":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"f118e893-ccca-4d37-9791-b33fb265d053","version":0,"typeName":"hive_process","state":"ACTIVE"},"typeName":"hive_process","values":{"queryId":"hive_20160816002633_af9920a4-3cca-461a-ab00-87c9454e5cba","name":"create table sample_11 as select * from sample_10 where salary > 60000","startTime":"2016-08-16T00:26:33.732Z","queryPlan":"{}","description":null,"operationType":"CREATETABLE_AS_SELECT","outputs":[{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"e7d3a765-662e-4303-ab32-251f22234382","version":0,"typeName":"DataSet","state":"ACTIVE"}],"endTime":"2016-08-16T00:26:39.424Z","recentQueries":["create table sample_11 as select * from sample_10 where salary > 60000"],"inputs":[{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"d24ee236-417e-4c12-ab0e-44bab7abb567","version":0,"typeName":"DataSet","state":"ACTIVE"}],"qualifiedName":"CREATETABLE_AS_SELECT:default.sample_10@sandbox->:default.sample_11@sandbox","queryText":"create table sample_11 as select * from sample_10 where salary > 60000","owner":null,"clusterName":"Sandbox","queryGraph":null,"userName":"admin"},"traitNames":[],"traits":{}}}

This is and entity based on the hive_process type (resulting from a create table statement). Notice that the Input and Output fields contain entity references. If those fields were to be cleared or modified to contain fewer referenced, the resulting Lineage graph should change. Give it a try and respond with a comment if you have any follow-ups.

View solution in original post

5 REPLIES 5

avatar
Guru

@Manoj Dhake

There are several super types in Atlas that most of the existing types inherit from. Two key super types are Process and DataSet. The Process type has two fields that play a key role in Lineage tracking, Inputs and Outputs.

{"typeName":"Process","definition":{"enumTypes":[],"structTypes":[],"traitTypes":[],"classTypes":[{"superTypes":["Referenceable","Asset"],"hierarchicalMetaTypeName":"org.apache.atlas.typesystem.types.ClassType","typeName":"Process","typeDescription":null,"attributeDefinitions":[{"name":"inputs","dataTypeName":"array<DataSet>","multiplicity":"optional","isComposite":false,"isUnique":false,"isIndexable":true,"reverseAttributeName":null},{"name":"outputs","dataTypeName":"array<DataSet>","multiplicity":"optional","isComposite":false,"isUnique":false,"isIndexable":true,"reverseAttributeName":null}]}]},"requestId":"qtp1853177759-436 - 00d1cf83-1bc6-4b49-820f-d907e42c4c27"}

In your case, the "create table if" entities are Process types and posses the Input and Output attributes. The reasons Atlas knows to connect EMR.PATIENT... and EMR.CONDITION... entities to the first "create table if" entity is that EMR.PATIENT and EMR.CONDITION are both entities based on DataSet types that are referenced in the Input field of "create table if" entity. Similarly the PATIENT360 entity is also of DataSet type that is referenced in the Output field of that same "create table if" entity. Here is a generic example with a Hive Table:

{"requestId":"qtp1853177759-388 - a98ad750-6fd7-41e9-8fbd-4117c844f8d1","definition":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference","id":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"f118e893-ccca-4d37-9791-b33fb265d053","version":0,"typeName":"hive_process","state":"ACTIVE"},"typeName":"hive_process","values":{"queryId":"hive_20160816002633_af9920a4-3cca-461a-ab00-87c9454e5cba","name":"create table sample_11 as select * from sample_10 where salary > 60000","startTime":"2016-08-16T00:26:33.732Z","queryPlan":"{}","description":null,"operationType":"CREATETABLE_AS_SELECT","outputs":[{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"e7d3a765-662e-4303-ab32-251f22234382","version":0,"typeName":"DataSet","state":"ACTIVE"}],"endTime":"2016-08-16T00:26:39.424Z","recentQueries":["create table sample_11 as select * from sample_10 where salary > 60000"],"inputs":[{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"d24ee236-417e-4c12-ab0e-44bab7abb567","version":0,"typeName":"DataSet","state":"ACTIVE"}],"qualifiedName":"CREATETABLE_AS_SELECT:default.sample_10@sandbox->:default.sample_11@sandbox","queryText":"create table sample_11 as select * from sample_10 where salary > 60000","owner":null,"clusterName":"Sandbox","queryGraph":null,"userName":"admin"},"traitNames":[],"traits":{}}}

This is and entity based on the hive_process type (resulting from a create table statement). Notice that the Input and Output fields contain entity references. If those fields were to be cleared or modified to contain fewer referenced, the resulting Lineage graph should change. Give it a try and respond with a comment if you have any follow-ups.

avatar
Super Collaborator

Hi vadim

You are saying to delete input and output entites but how to delete those using rest api?

Is there any rest api available for that?

avatar
Guru

@Manoj Dhake

Try this:

curl -u admin:admin -d @{location of file}/data.json -X POST https://sandbox.hortonworks.com:21000/api/atlas/entities/{guid}

The payload (contents of the data.json file) should look something like this

{
  "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
  "id": {
    "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
    "id": "f118e893-ccca-4d37-9791-b33fb265d053",
    "version": 0,
    "typeName": "hive_process",
    "state": "ACTIVE"
  },
  "typeName": "hive_process",
  "values": {
    "outputs": [
      
    ],    
    "inputs": [
      
    ],
  },
  "traitNames": [
    
  ],
  "traits": {
    
  }
}

Basically, you just send the ID block of the target entity and then the values that you want to change. In this case, you only send the input and output values as blank arrays. That should clear those those fields and remove the lineage graph. If you only want to remove some of the lineage, then remove only the entity references that you don't want to see in the lineage graph. Let me know how that works out.

avatar

@Vadim Vaks

I'm trying to do the same thing with Atlas 0.8. But I can't delete entries within inputs or outputs array with this method.

With V2 API, elements didn't change. With V1 API, new elements are added even if I removed some from inputs array. The inputs had two entries before POST request, and I posted a single input entry and it got added:

      "inputs": [
        {
          "guid": "688ed1ee-222c-4416-8bf4-ba107b7fbc2c",
          "typeName": "kafka_topic"
        },
        {
          "guid": "bf3784db-fa59-4803-ad41-c5653f242f6f",
          "typeName": "kafka_topic"
        },
        {
          "guid": "688ed1ee-222c-4416-8bf4-ba107b7fbc2c",
          "typeName": "kafka_topic"
        }
      ],

Please let me know how to remove elements from inputs/outputs with Atlas 0.8.

Thanks!

avatar
Super Collaborator

Thanks Vadim,

This works for me.

but suppose if i want to clear all metadata including tad metadata,hive relaed metadata etc. so Is it possible in atlas?

I dont want to re-install atlas but wants to only just clear metadata.I have configured "berkeley database" for storing the metadata information.Do you know how to access this graph based database?

and can we delete metadata by accessing this database?

How to take access of it?

If you know then could you please send me steps/additional software required to access graph database?

Thank you in advance.