About mjohnson

mjohnson · ‎02-10-2017

There are two ways to setup an existing taxonomy into Atlas; 1.) If you have Hive, then run the import-hive.sh utility. This utility scans all of the hive tables and creates the matching Atlas entries. For 2.5.3, you will find the utility at ./2.5.3.0-37/atlas/hook-bin/import-hive.sh 2.) You can also add entries and update properties using the REST API. One article describing this is available at: https://community.hortonworks.com/content/kbentry/74064/add-custom-properties-to-existing-atlas-types-in-s.html

mjohnson · ‎02-06-2017

Overview Atlas provides powerful Tagging capabilities which Data Analysts to identify all data sets containing specific types of data. The Atlas UI itself provides a powerful Tag based search capability which require no REST API interaction. However, for those of you out there who need to integrate Tag based search with some of their data discovery and governance activities, this posting is for you. Within this posting are some instructions regarding how you can use the Atlas REST API to retrieve entity data based on a TAG name. Before getting too deep into the Atlas Tag search examples it is important to recognize that Atlas Tags are basically a form of an Atlas type. If you invoke the REST API command “/api/atlas/types”, in the summary output below interspersed between standard Atlas types such as ‘hive_table’, ‘jms_topic’, etc., will be the current set of user defined Atlas Tags (CUSTOMER & SALES) as shown below: "count": 35, "requestId": "qtp1177377518-81 - c7d4a853-02a0-4a1e-9b50-f7375f6e5f08", "results": [ "falcon_feed_replication", "falcon_process", "DataSet", "falcon_feed_creation", "file_action", "hive_order", "Process", "hive_table", "hive_db", … "Infrastructure", "CUSTOMER", "Asset", "storm_spout", "SALES", "hive_column", … ] In the rest of the article we will expand on the Atlas types API to explore how we can perform two different types of TAG based searches. Before going too far it is important to note that the source code for the following examples are available through this repo. Tag Search Example #1: Simple REST based Tag Based Search example In our first Tag search example our objective is to return a list of Atlas Data Entities which have the query TAG name assigned. In this example, we are going to search our atlas instance on (‘server1’ port 21000) for all Atlas entities with a tag named CUSTOMER. You will want to replace CUSTOMER with an existing tag on your system. Our Atlas DSL query to find the CUSTOMER tag using the ‘curl’ command is as shown below: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=CUSTOMER The example above returns a list of the entity guids which have the Atlas Tag ‘CUSTOMER’ defined to the Atlas host ‘server1’ on port 21000. To run this query on your own cluster or on a sandbox just substitute the Atlas Server Host URL, Atlas Server Port number, login information and your Tag name and then invoke as shown above with curl (or SimpleAtlasTagSearch.py in the Python example in the referenced Repo at the end of this article). An output from this REST API query on my cluster is shown below: { "count": 2, "dataType": { "attributeDefinitions": [ … ], "typeDescription": null, "typeName": "__tempQueryResultStruct120" }, "query": "CUSTOMER", "queryType": "dsl", "requestId": "qtp1177377518-81 - 624fc6b9-e3cc-4ab7-80ba-c6a57d6ef3fd", "results": [ { "$typeName$": "__tempQueryResultStruct120", "instanceInfo": { "$typeName$": "__IdType", "guid": "806362dc-0709-47ca-af16-fac81184c130", "state": "ACTIVE", "typeName": "hive_table" }, "traitDetails": null }, { "$typeName$": "__tempQueryResultStruct120", "instanceInfo": { "$typeName$": "__IdType", "guid": "4138c963-b20d-4d10-b338-2c334202af43", "state": "ACTIVE", "typeName": "hive_table" }, "traitDetails": null } ] } The results from this query can be thought of having 3 sections: results header where you can find the results count Returned DataTypes Results (list of entity guids) For our purposes we are really only interested in the list of entities, so all you need to do is focus on extracting the important information from the .results jsonpath object in the return json object. Looking at the results section we observe that only one entity has the CUSTOMER tag assigned. This entity located by the search has the guid assigned of ‘4138c963-b20d-4d10-b338-2c334202af43’ we see is an active entity (not deleted). We can now use the entity search capabilities to retrieve the actual entity as described in the next example within this article. Example #2: Returning details on all entities based on Tag assignment The beauty of Example #1 is we can build an entity list using a single REST API call. However, for the real world we will want access to details about the assigned entities. To accomplish this, we will need a programming interface such as Python, Java, Scala, bash what your favorite tool is, etc. to pull the GUIDs and then perform entity searches. For the purposes of this posting, we will use Python to illustrate how to perform more powerful Atlas Tag searches. The example below performs two Atlas REST API queries to build a json object containing the details and not just guids for the entities with our Tag assigned. def atlasGET( restAPI ) :<br> url = "http://"+ATLAS_DOMAIN+":"+ATLAS_PORT+restAPI<br> r= requests.get(url, auth=("admin", "admin"))return(json.loads(r.text)); results = atlasGET("/api/atlas/discovery/search/dsl?query={0}".format(TAG_NAME)) entityGuidList = results['results'] entityList = [] for entity in entityGuidList: guid = entity['instanceInfo']['guid'] entityDetail = atlasGET("/api/atlas/entities/{0}".format(guid)) entityList.append(entityDetail); print json.dumps(entityList, indent=4, sort_keys=True) The output from this script is now available for more sophisticated data governance and data discovery projects. Atlas Tag Based Search Limitations As powerful as both the Atlas UI and Atlas REST API Tag based searches are, there are some limitations to be aware: Atlas supports only searching on one TAG at a time. It is impossible to include other entity properties in the TAG searches The Atlas REST API used for TAG searches can only return a list of GUIDs. It is not possible to search for TAG attributes

mjohnson · ‎12-23-2016

Overview Data Governance is unique for each organization and every organization needs to track a different set of properties for their data assets. Fortunately, Atlas provides the flexibility to add new data asset properties to support your organization’s data governance requirements. The objective for this article is to describe the steps utilizing the Atlas REST API to add new Atlas properties to your Atlas Types. Add a new Property for an existing Atlas Type To simplify this article, we will focus in on the 3 steps required to add and enable for display a custom property to the standard Atlas property ‘hive_table’. Following these steps, you should be able to modify the ‘hive_table’ Atlas Type and add custom properties which are available to enter values, view in the Atlas UI and search. To make the article easier to read the JSON file is shown in small chunks. To view the full JSON file as well as other files used to research for this article, check out this repo. Step 1: Define the custom property JSON The most import step of this process is properly defining the JSON used to update your Atlas Type. There are three parts to the JSON object we will pass to Atlas; The header – contains the type identifier and some other meta information required by Atlas The actual new property definition The required existing Atlas type properties Defining the Header Frankly, the header is just standard JSON elements which get repeated every time you define a new property. The only change we need to make to the header block shown below for each example is to get the ‘typeName’ JSON element properly set. In our case as shown below we want to add a property defined for all Hive tables so we have correctly defined the typeName to be ‘hive_table’. {"enumTypes": [],"structTypes": [],"traitTypes": [],"classTypes": [ {"superTypes": ["DataSet"],"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType","typeName": "hive_table","typeDescription": null, Keep in mind that all the JSON elements shown above pertain to the Atlas type which we plan to modify. Define the new Atlas Property For this example, we are adding a property called ‘DataOwner’ which we intend to contain the owner of the data from a governance perspective. For our purposes, we have the following requirements: Requirement Attribute Property Assignment The property is searchable isIndexable True The property will contain a string datatype String Not all Hive tables will have an owner Multiplicity Optional A Data owner can be assigned to multiple Hive tables isUnique false Based on the above requirements, we end up with a property definition as shown below: {"name": "DataOwner","dataTypeName": "string","multiplicity": "optional","isComposite": false,"isUnique": false,"isIndexable": true,"reverseAttributeName": null}, When defining Atlas properties, you can as shown in the file, it is possible to define multiple properties at one time, so take your time and try and define all of the properties at once. Make certain you include an existing Properties An annoying thing about the Atlas v1 REST api is the need to include some of the other key properties in your JSON file. For this example, which was running on HDP 2.5.3 I had to define a bunch of properties. And every time you add a new custom property it is necessary to include those custom properties in your JSON. If you check out the file JSON file used for this example you will find a long list of properties which are required as of HDP 2.5.0. Step 2: PUT the Atlas property update We now have the full JSON request constructed with our new property requirements. So it is time to PUT the JSON file using the ATLAS REST API v1. For the text of this article I am using ‘curl’ to make the example clearer, though for the associated repo python is used to make life a little easier. To execute the PUT REST request we will first need to collect the following data elements: Data Element Where to find it Atlas Admin User Id This is a defined ‘administrative’ user for the Atlas System. It is the same user id which you use to log into Atlas. Atlas Password The password associated with Atlas Admin User Id Atlas Server The Atlas Metadata Server. This can be found by selecting the Atlas server from Ambari and then looking in the summary tab. Atlas Port It is normally 21000. Check the Ambari Atlas configs for the specific port in your cluster Update_hive_table_type.json This is the name of the JSON file containing our new Atlas property definition curl -ivH -d @update_hive_table_type.json --header "Content-Type: application/json" -u {Atlas Admin User Id}:{Atlas Password} -X PUT http://{Atlas Server}:{Atlas Port}/api/atlas/types If all is successful, then we should see a result like that which is shown below. The only thing you will need to verify in the result (other than the lack of any reported errors) is that then “name” element is the same as the Atlas type to which you are adding a new custom property. { "requestId": "qtp1177377518-235-fcf1c6f4-5993-49ac-8f5b-cdaafd01f2c0", "types": [ { "name": "hive_table" } ]} However, if you are like me, then you probably will make a couple of mistakes along the way. To help you identify root cause for your errors, here is a short list of errors and how to resolve them: Error #1: Missing a necessary Atlas property for the Type An error encountered like shown below is because your JSON with the new custom property is missing an existing property. { "error": "hive_table can't be updated - Old Attribute stats:numRows is missing", "stackTrace": "org.apache.atlas.typesystem.types.TypeUpdateException: hive_table can't be updated - Old Attribute stats:numRows is missing\n\tat The solution to fix this problem is to add that property along with your custom property in your JSON file. If you are uncertain as to the exact definition for the property, then execute the execute Atlas REST API GET call as shown below to list out the Atlas Type you are currently modifying properties: curl -H –u {Atlas Admin User id}:{Atlas password}-X GET http://{Atlas Server}/api/atlas/types Error #2: Unknown datatype: An error occurred like the one below: { "error": "Unknown datatype: XRAY", "stackTrace": "org.apache.atlas.typesystem.exception.TypeNotFoundException: Unknown In this case, you have entered an incorrect Atlas Data Type. The allowed for data types include: byte short int long float double biginteger bigdecimal date string {custom types} The {custom types} enables you to reference another Atlas type. So for example you decide to create a ‘SecurityRules’ Atlas data type which itself contains a list of properties, you would just insert the SecurityRules type name as the property. Error #n: Added incorrectly a new Atlas property for a type and you need to delete it This is the reason why you ALWAYS want to modify Atlas Types and Properties in a Sandbox developer region. DO NOT EXPERIMENT WITH CUSTOMING ATLAS TYPES IN PRODUCTION!!!!! If you ignore this standard approach in most organizations SLDC, your solution is to delete the Atlas Service from within Ambari, re-add the service and then re-add all your data. Not fun. Step 3: Check out the results As we see above, our new custom Atlas ‘hive_table’ property is now visible in the Atlas UI for all tables. As the property was just defined for all ‘hive_table’ data assets the value is null. Your next step which is covered in the Article Modify Atlas Entity properties using REST API commands is to assign a value the new property. Bibliography Atlas Rest API Atlas Technical User Guide Atlas REST API Search Techniques Modify Atlas Entity properties using REST API commands

mjohnson · ‎10-26-2016

Overview The whole purpose of that list is to be able to search through all the entities contained within your data lake and identify specifically those entities which will lead your analysis. A Common complaint with the idea of the data lake is that you load too many files into the data lake and it becomes unwieldy. Atlas addresses this problem by providing powerful search tools to identify all of the Data entities located within the data lake. The purpose of this article is to explore the four Entity search techniques available within Atlas. Atlas search options compared The following chart summarizes the capabilities for each of the four atlas entity search techniques explored in this article. The sub sections below in this article we’ll go into more detail on how to set up the search techniques. Attribute Entity Search Qualified Name Search DSL Search Full Text Search Can identify multiple entities Yes No Yes Yes Supports filtering No Yes Yes Yes Free text No No No Yes Ability to search on sub-attributes No No Yes No Primary Value Listing all Entities of a given type Retrieve a specific entity record Complex searches of Entities Locating records primarily based on name, Comment and description fields To simplify this article we are going to focus on just the ‘hive_table’ type, though keep in mind that for all the search examples covered here, you can use any Atlas type. Preparations for the Search Examples To support the search examples first were going to need to gather the following pieces of information: Configuration Property Description Where to find it Ranger user id The user id used to log into the Ranger UI screen. This login will be for a Ranger Administrative user with a default id of ‘admin’ Password The matching password to the Ranger User id. For this article I will be using the Ranger default password ‘admin’ Ranger Admin Server The Administrative server for Ranger. For this article, I will be using the FQDN ‘server1.hdp’ and also assuming the default port 21000. Go into Ambari, select the Ranger than QuickLinks. The host name for the Ranger UI is the proper value for this parameter GUID A unique Entity identifier. Found in many of this article’s search result responses. Fully Qualified Table Name A unique identifier for Hive tables. Is created by concatenating the following: {database name}.{table name}@{cluster Name}. So, if you had a database called ‘transports’ and a table named ‘drivers’ on a cluster named HDP, then the fully qualified table name would be ‘transports.drivers@HDP’. For all examples contained within this article, to avoid creating overly long result listings, you will often see a “…” marker in the middle of a list. When you see this know that my intent was to spare you excessive scrolling when reviewing the result set. All of the examples are presented in this article using the curl command as that is commonly available in Unix (Linux & Mac) based systems. For Python examples, you can find more detailed examples through this articles matching GitHub repository. Atlas Entity Search Example The Atlas Entity Search technique is the simplest of all of those explored in this article. It’s entire purpose is to retrieve all Entities of the specified type with no additional filtering enabled. To retrieve a JSON list containing all the entities you will use the REST API command: curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}:21000/api/atlas/entities?type=hive_table Now let’s take a look at an actual example: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/entities?type=hive_table In this example, we see as shown below the response contains two parts; (1) A header set containing an element count and RequestID, and (2) A list of GUIDs associated with all the Hive tables known to Atlas. { "count": 26, "requestId": "qtp1783047508-5870 - 867273a6-10e1-4c65-8da2-08a06b89d005", "results": [ "d3d637c5-df7e-4311-adc3-bcc6c4b81fb1", "cdbbd999-f789-4d2e-9127-b2443209b3b7", "848c05fa-f2d9-4482-8892-d4b4fc137ee6", … "03e38f24-577a-45ae-b67f-54a3e34f34ce", "4945f76b-6403-483a-b9aa-3161ce3e4bd6", "90d76c28-911e-425b-845f-5e1096eed3bb", "9571fb0e-52f5-4c16-a8f1-d2cf5138824c" ], "typeName": "hive_table" } Notice in the example above we find that there are 26 hive tables. Now unfortunately this sort of output by itself is not useful, so this API should be considered as one component in the entity retrieval. To actually see the entity full details you would want to run the REST call below using one of the GUID from the above response: curl -iv -u admin:admin -X GET http://{Ranger Admin Server}:21000/api/atlas/entities/{GUID} Atlas Qualified Name Search Often we will only want to look at one end of the record. In these cases, we can use the atlas qualified names search to retrieve a single entity Record. In Atlas for Hive tables the qualified name represents the concatenation of the database name the table name and finally the cluster team. curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}::21000/api/atlas/entities?type=hive_table&property=qualifiedName&value={Fully Qualified Table Name} A result set from this query would cover all of the metadata captured by Atlas for the Fully Qualified Table Name as shown below (sorry showing it all so you will have to scroll): { "definition": { "id": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "TLC" ], "traits": { "TLC": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "TLC", "values": {} } }, "typeName": "hive_table", "values": { "aliases": null, "columns": [ { "id": { "id": "1690ccc2-d7be-45af-becb-c6b360a1a30f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "driverid", "owner": "hive", "qualifiedName": "default.drivers.driverid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "249a7ce3-6b19-418e-9094-7d8a30bc596f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "CARRIER" ], "traits": { "CARRIER": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "CARRIER", "values": {} } }, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "companyid", "owner": "hive", "qualifiedName": "default.drivers.companyid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "customer", "owner": "hive", "qualifiedName": "default.drivers.customer@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(40)" } }, { "id": { "id": "143479a3-be79-4f04-b649-4a09b5429ace", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "drivername", "owner": "hive", "qualifiedName": "default.drivers.drivername@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(75)" } }, { "id": { "id": "6c3123a9-0d09-490b-840d-6cc012ab69e0", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "yearsdriving", "owner": "hive", "qualifiedName": "default.drivers.yearsdriving@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "int" } }, { "id": { "id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "riskscore", "owner": "hive", "qualifiedName": "default.drivers.riskscore@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(5)" } } ], "comment": null, "createTime": "2016-10-11T17:11:11.000Z", "db": { "id": "332189cc-d994-44c2-8f87-29a28a471434", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_db", "version": 0 }, "description": "\"I get my answers from HCC\"", "lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers", "owner": "hive", "parameters": { "COLUMN_STATS_ACCURATE": "{\"BASIC_STATS\":\"true\"}", "EXTERNAL": "TRUE", "numFiles": "1", "numRows": "4278", "rawDataSize": "1967880", "totalSize": "68597", "transient_lastDdlTime": "1476205880" }, "partitionKeys": null, "qualifiedName": "default.drivers@HDP", "retention": 0, "sd": { "id": { "id": "36166469-1014-4645-98a6-9df34b37a145", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_storagedesc", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_storagedesc", "values": { "bucketCols": null, "compressed": false, "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat", "location": "hdfs://server1.hdp:8020/apps/hive/warehouse/drivers", "numBuckets": -1, "outputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat", "parameters": null, "qualifiedName": "default.drivers@HDP_storage", "serdeInfo": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "hive_serde", "values": { "name": null, "parameters": { "serialization.format": "1" }, "serializationLib": "org.apache.hadoop.hive.ql.io.orc.OrcSerde" } }, "sortCols": null, "storedAsSubDirectories": false, "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 } } }, "tableType": "EXTERNAL_TABLE", "temporary": false, "viewExpandedText": null, "viewOriginalText": null } }, "requestId": "qtp1783047508-5870 - 4f0cc268-7fdc-49ac-a585-74f486ee3786" You thing you will immediately notice is the output, like most JSON outputs is hierarchical. For qualified name searches as well as most of the others, it is only possible to search on the top levels. The one exception to this are Atlas Entity DSL searches which are covered in the next section. Atlas Entity DSL search Example The Atlas Entity DSL search is by far the most powerful of the options covered in this article. The DSL search enables selection of Entities based on a combination of properties, Atlas types, as well as sub properties within the entity definition. In addition, the DSL search allows to select only those properties desired for display so you don’t have to output monstrous documents. DSL Search Example #1: SIMPLE DSL SEARCH BASED ON TABLENAME ONLY This example is similar to the fully qualified table name example described earlier in this article. The primary difference, as shown below is the ability to search only on the table name and allow the data base and cluster name to stay as wildcards. curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='{TableName}’ To make the query above work it must be properly encoded. As shown below the ‘+’ character is used as the spacing between keywords. { "count": 7, "dataType": { "attributeDefinitions": [ { "dataTypeName": "hive_db", "isComposite": false, "isIndexable": false, "isUnique": false, "multiplicity": { "isUnique": false, "lower": 1, "upper": 1 }, "name": "db", "reverseAttributeName": null }, … }, "query": "hive_table where name='drivers'", "queryType": "dsl", "requestId": "qtp1783047508-19 - 79fcb168-d047-4b70-9129-adebba09b323", "results": [ { … } In the sample output above, we see that the sample query identified seven entities where are name is equal to tablename and the atlas type is a hive table. In addition, in the result header we see a list of each of the attributes output as part of the result set. Starting with the JSON element “query”, we see the query and the query type used. And then at the very end of the results set is a section titled called results which will output the detail for each entity found much as we saw earlier in the qualified cable name search in one of the sections above. DLS Search Example #2: Limiting the number of rows output There are several DSL search options which control things such as row limit, order by among others. This section Will show how they can be used through the use of the row limit option as shown below: curl -iv -u {Ranger userId:Password} -X GET http://{Ranger Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers' limit 3 In the example above, we see the account of entities founded and returned is equal to three which exactly matches the limit specified in the query shown above. The contents of the attribute definitions as well as the resulting Entity properties is the same as the other DSL searches covered in this article. { "count": 3, "dataType": { "attributeDefinitions": [ { "dataTypeName": "hive_db", "isComposite": false, "isIndexable": false, "isUnique": false, "multiplicity": { "isUnique": false, "lower": 1, "upper": 1 }, "name": "db", "reverseAttributeName": null },… DSL Search Example #3: DSL SEARCH BASED ON TABLENAME ONLY DEMONSTRATING DISPLAYING ONLY SELECT PROPERTIES As the prior examples illustrate, The DSL search options can how put a lot of data, but much of it useless for your analyses. To constrain the types of properties returned the atlas DSL search has the ‘select’ keyword to limit the properties actually output. The following example illustrates how to invoke the ‘select’ option. In the example below, the output is constrained to just the properties “tableType, temporary, retention, qualifiedName, description, name, owner, comment, createTime” and reduces the response document output size: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers' select tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime limit 1 The result from the query above is shown in the listing below: { "count": 1, "dataType": { "attributeDefinitions": [ { "dataTypeName": "string", "isComposite": false, "isIndexable": false, "isUnique": false, "multiplicity": { "isUnique": false, "lower": 0, "upper": 1 }, "name": "tableType", "reverseAttributeName": null }, … }, "query": "hive_table where name='drivers' select tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime limit 1", "queryType": "dsl", "requestId": "qtp1783047508-21 - 347d7796-cce3-4eb6-8a40-59cedc07433a", "results": [ { "$typeName$": "__tempQueryResultStruct48", "comment": null, "createTime": "2016-10-11T16:50:03.000Z", "description": "\"Try this value again\"", "name": "drivers", "owner": "hive", "qualifiedName": "default.drivers@HDP", "retention": 0, "tableType": "EXTERNAL_TABLE", "temporary": false } ]} In the example query above, only one entity was identified and as you can see in the results block only those property names specified after the select option art display rather than the long hierarchically deep output we’ve seen in prior examples. DSL Search Example #4: DSL SEARCH FOR ENTITIES CONTAINING COLUMN NAME As nice as limiting output to just the first-order fields may seem, often it is necessary to query on some arrays of sub-properties in the entity definition. One complex property often needed in data discovery queries in the ‘column’ property. If you want to search for all hive_tables where a specific field resides, possibly one used often as a foreign key, then you should check out this example: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table%2C+columns+where+name%3D%27tweet_id%27 In the DSL query above, you will see the encoded string %2C (‘,’) following after the data type. That comma is indicating that we want to look at the values assigned to the property name immediately following…which in this case is the property named ‘columns’. Specifically, the example above is requesting all Hive tables in which there exists a column named ‘tweet_id’. The use of the ’+’ to separate keywords is just as it has been for the other dsl queries, however, for the query of the sub-property it must be entirely encoded as you can see in the example above. The result from the above example is: { "count": 1, "dataType": { "attributeDefinitions": [ {… ], "hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType", "superTypes": [ "DataSet" ], "typeDescription": null, "typeName": "hive_column" }, "query": "hive_table, columns where name='tweet_id'", "queryType": "dsl", "requestId": "qtp1783047508-5868 - 3db60f55-8e7b-409b-9a6f-2abea20b8371", "results": [ { "$id$": { "$typeName$": "hive_column", "id": "b32fc0ab-2f66-4ad1-9728-ccd1d48dbf32", "state": "ACTIVE", "version": 0 },… As we see in the output above, only 1 entity was found with a columname equal to ‘tweet_id’. You will also note in the example above that the output is the exact same for all of the hive_table types covered so far in this article. If you wanted to you could make this query more powerful to return a limit of values and only include a select set of properties in the results output. Atlas Full Text Search Example Our last type of query will free search the top level entity properties for the query string passed. The example below for example, is looking for any top level property which contains the string ‘sku’. curl –iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/fulltext?query=sku The search results for this query option are similar to the entity list option explored at the start of this article in that it only lists the entitys’ GUID where the query is equal to the text supplied (‘sku’ in this example). One other interesting output property is the score. The ‘score’ property is an attempt to evaluate the search output quality as you find in many search engines. The output below shows a sample response from the query above: { "count": 4, "query": "sku", "queryType": "full-text", "requestId": "qtp1783047508-5868 - 5270285e-5bf4-43e5-82ed-46dc8901c461", "results": [ { "guid": "d37cfeab-afbc-41d0-8dda-71da29043bc3", "score": 0.7805047, "typeName": "hive_process" }, { "guid": "b9b7dae2-775b-4c95-82b6-9d0ab2097a2c", "score": 0.65042055, "typeName": "hive_process" }, { "guid": "c22f07fd-7522-4f37-97eb-96e3a9bc16bc", "score": 0.6008328, "typeName": "hive_column" }, { "guid": "f23097c3-d836-4543-99d6-b08f6cdcc97a", "score": 0.6008328, "typeName": "hive_column" } ]} Bibliography: Atlas Search Examples in Python Atlas REST Search API

mjohnson · ‎10-26-2016

The article: Modify Atlas Entity properties using REST API commands contains a full description for how to update both the comment and description entity properties for Atlas managed hive_table types.

mjohnson · ‎10-19-2016

Overview This article reviews the steps necessary to update Hive entities within Atlas the Description and Comment fields. The 0.70 Atlas release will display and allow text searches on the ‘‘description’ field, but the Atlas UI does not at this time support the ability to manually enter those properties into a given data Asset. Examined in this article includes: Searching for a Hive_Table entity Update a single property for the Hive_Table entity definition (“description”) The Problem: In release 0.70, Atlas has the ability to monitor additions as well as changes to Hive table and Hive columns. When Atlas identifies a new entry or change the appropriate Metadata property is updated for that entity. One very cool aspect to Atlas is the ability to conduct either DSL or free text searches on any properties set for the entity. Anyone trying to identify datasets to support a specific analytic activity will definitely appreciate the ability search through all of the entities and quickly discover valuable data assets in the data lake without having to relying on tribal knowledge. For this Article we will update a specific table based on its full qualified name and then assign a new description field to the table. The full source code for the examples covered in this article on GitHub. The code for this example is written in Python and there is a full set of instructions in the repository README.md file. Locating the Entity whose properties require updating Now let’s assume that in our ‘HDP’ cluster within the ‘default’ database there exists a table named ‘drivers’. For this table, our objective is to change the ‘description’ property from its current value to a value of ‘I get my answers from HCC’. Entity property updates are made one at a time, so our first step is to collect the Guid for our target table. As this article is about the update of a property within an Hive_table Entity, we will limit the search coverage to identifying a unique Hive_table. The query values for this example are: Property Value used in this article Comments on how to change the provided values for your cluster. Atas server FQDN server1.hdp Use your server's Atlas Metadataserver FQDN entityType hive_table Can be any valid Atlas Type database name default Specify your table's database name. table name drivers This can be any Hive Table whose metadata is already in Atlas. The table name you provide must already exist on your specified cluster. Cluster name HDP The name of your cluster An Atlas entity can be any variety of types. The beauty of this architecture is the same search steps are available whether seeking a table, a hive column, or some other Atlas managed type. The format we will use for this search example is: HTTP://{Atlas server FQDN}:21000/api/atlas/entities?type={entitytype}&property=qualifiedName&value={databasename}.{table name}@{Cluster name} So for our example, the exact REST query would be: http://server1.hdp:21000/api/atlas/entities?type=hive_table&property=qualifiedName&value=default.drivers@HDP The full result as shown below from this REST query will contain the guid necessary for the update along with all of the hive_table’s metadata information as shown below: { "definition": { "id": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "TLC" ], "traits": { "TLC": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "TLC", "values": {} } }, "typeName": "hive_table", "values": { "aliases": null, "columns": [ { "id": { "id": "1690ccc2-d7be-45af-becb-c6b360a1a30f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "driverid", "owner": "hive", "qualifiedName": "default.drivers.driverid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "249a7ce3-6b19-418e-9094-7d8a30bc596f", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [ "CARRIER" ], "traits": { "CARRIER": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "CARRIER", "values": {} } }, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "companyid", "owner": "hive", "qualifiedName": "default.drivers.companyid@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(15)" } }, { "id": { "id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "customer", "owner": "hive", "qualifiedName": "default.drivers.customer@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(40)" } }, { "id": { "id": "143479a3-be79-4f04-b649-4a09b5429ace", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "drivername", "owner": "hive", "qualifiedName": "default.drivers.drivername@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(75)" } }, { "id": { "id": "6c3123a9-0d09-490b-840d-6cc012ab69e0", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "yearsdriving", "owner": "hive", "qualifiedName": "default.drivers.yearsdriving@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "int" } }, { "id": { "id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_column", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_column", "values": { "comment": null, "description": null, "name": "riskscore", "owner": "hive", "qualifiedName": "default.drivers.riskscore@HDP", "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 }, "type": "varchar(5)" } } ], "comment": null, "createTime": "2016-10-11T17:11:11.000Z", "db": { "id": "332189cc-d994-44c2-8f87-29a28a471434", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_db", "version": 0 }, "description": "\"changeMe\"", "lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers", "owner": "hive", "parameters": { "COLUMN_STATS_ACCURATE": "{\"BASIC_STATS\":\"true\"}", "EXTERNAL": "TRUE", "numFiles": "1", "numRows": "4278", "rawDataSize": "1967880", "totalSize": "68597", "transient_lastDdlTime": "1476205880" }, "partitionKeys": null, "qualifiedName": "default.drivers@HDP", "retention": 0, "sd": { "id": { "id": "36166469-1014-4645-98a6-9df34b37a145", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_storagedesc", "version": 0 }, "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [], "traits": {}, "typeName": "hive_storagedesc", "values": { "bucketCols": null, "compressed": false, "inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat", "location": "hdfs://server1.hdp:8020/apps/hive/warehouse/drivers", "numBuckets": -1, "outputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat", "parameters": null, "qualifiedName": "default.drivers@HDP_storage", "serdeInfo": { "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName": "hive_serde", "values": { "name": null, "parameters": { "serialization.format": "1" }, "serializationLib": "org.apache.hadoop.hive.ql.io.orc.OrcSerde" } }, "sortCols": null, "storedAsSubDirectories": false, "table": { "id": "b78b5541-a205-4f9e-8b81-e20632a88ad5", "jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id", "state": "ACTIVE", "typeName": "hive_table", "version": 0 } } }, "tableType": "EXTERNAL_TABLE", "temporary": false, "viewExpandedText": null, "viewOriginalText": null } }, "requestId": "qtp511473681-34831 - b088be5b-44e6-4a2c-bd4a-7beeb059cf4f"} In the result set above, locate the "id" property value which is the GUID and the "description" property with the current value of "changeMe". In this case we will use the REST query results definition.id.id value of ‘b78b5541-a205-4f9e-8b81-e20632a88ad5’ to support our next REST query to update the property value. We can also see in the ‘description’ field which is highlighted in bold currently has the value of “changeMe”. Updating an Entities Property value Now that we have the GUID, it is time to update the ‘description’ property from ‘changeMe’ to ‘I get my answers from HCC’. The update entity property REST command requires the GUID from the prior search step. To update the property, we will use the POST entity Atlas REST Command rolling the url query format and include the string "I get my answers from HCC" in the post message payload: http://{Atlas server FQDN}:21000/api/atlas/entities/{GUID from prior search operation}?property={atlas property field name} So to finish our example, with our payload containing the string "I get my answers from HCC", the actual query would be: http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description The result from the above command will be the current Metadata definition for our drivers table in JSON format as shown below: {… "description": "\"I get my answers from HCC\"", "lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers", "owner": "hive", "parameters": { "COLUMN_STATS_ACCURATE": "{\"BASIC_STATS\":\"true\"}", "EXTERNAL": "TRUE", "numFiles": "1", "numRows": "4278", "rawDataSize": "1967880", "totalSize": "68597", "transient_lastDdlTime": "1476205880"} Now let's go take a look at the Atlas UI, and check on the description for the drivers table. As we see in the screen print below, the new description property value has been successfully changed: Next Steps: This article attempts to take a simple property change example to illustrate the techniques necessary to modify the Atlas Metadata for a given entity. After you have completely run through this example, so follow on activities to experiment with include: Changing properties for different entity types such as Hive_column or any of the HBase types. Attempt to change some of the other top level property fields. Go through a list of Hive tables changing each 'description' property with the value from an external source. Resource Bibliography Atlas Entity REST API: Atlas Search grammar:

mjohnson · ‎08-08-2016

@saswati sahu The link provided above (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing) by @Sindhu provides documentation for all aspects of standard open source (Hortonworks) Hive indexing. Can you let me know what are your specific concerns so we can try to more specifically address them?

mjohnson · ‎08-04-2016

Hive Streaming Compaction This is the second part of the Hive Streaming Article series. In this article we will review the issues around compacting Hive Streaming files. One of the results of ingesting data through Hive streaming is the creation of many small 'Delta' files. Left uncompacted you could run the risk of running into NameNode capacity problems. Fortunately, compaction functionality is part of Hive Streaming. The remainder of this Article reviews design considerations as well as commands necessary to enable and control compaction for your Hive tables. Hive Compaction Design considerations The Compaction process has a set of cleaner processes running in the background during the ingest process looking for opportunities to compact the delta files based on the rules you specify. The first thing to keep in mind is that there are two forms of Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just consolidate the delta files. This approach does not have to worry about consolidating all of the delta files along with a large set of base bucket files and is thus the least disruptive to the system resources. ‘major’ compaction consolidates all of the delta files just like the ‘minor’ compaction and in addition it consolidates the delta files with the base to produce a very clean physical layout for the hive table. However, major compactions can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully. To provide greater control over the compaction process and avoid impacting other processes in addition to the compactor configuration options available, it is also possible to invoke compaction automatically by the cleaner threads or manually initiated when system load is low. The primary compaction configuration triggers to review when implementing or tuning your compaction processes are: hive.compactor.initiator.on hive.compactor.cleaner.run.interval hive.compactor.delta.num.threshold - Number of delta directories in a table or partition that will trigger a minor compaction. hive.compactor.delta.pct.threshold - Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. hive.compactor.abortedtxn.threshold - Number of aborted transactions involving a given table or partition that will trigger a major compaction A Hive Compaction Manual example In our example we have turned off major compaction as it should only run during off load periods. We take a look at the delta files for our table in hdfs and see that there are over 300 delta files and 5 base files. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest/_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500 -rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2123501_2133500 -rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2133501_2143500 -rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2143501_2153500 -rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2153501_2163500 A decision has been been made to run the major compaction manually during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT ‘major’” command to place the compaction job into the queue for processing. A compaction resource management queue was defined with a limited quota resource, so the compaction will not impact other jobs. hive> alter table acidtest compact 'major'; Compaction enqueued. OK Time taken: 0.037 seconds hive> show compactions; OK Database Table Partition Type State Worker Start Time default acidtest NULL MAJOR working server2.hdp-26 1459100244000 Time taken: 0.019 seconds, Fetched: 2 row(s) hive> show compactions; OK Database Table Partition Type State Worker Start Time Time taken: 0.016 seconds, Fetched: 1 row(s) hive>; The outstanding table compaction jobs are visible by executing the command line “SHOW COMPACTIONS as illustrated in the example above. Or the ‘major’ compaction is also visible through the Applications history log. After the ‘major’ compaction has completed, all of the delta files available at the time the compaction was initiated will have rolled up into the ‘base’ tables. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500 -rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00002 [hive@server1 ~]$ The end result of this example is that 305 consolidated to just 5 files. While 300 files will not impact the NameNode performance, it will most likely improve query performance as the Hive engine will have fewer files to scan to execute the query. Bibliography Hopefully, the example and source code supplied with this blog posting are sufficient to get you started with Hive Streaming and avoid potential problems. In addition to this blog posting some other resources which are useful references include: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions http://hortonworks.com/blog/adding-acid-to-apache-hive/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-transactions.html http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a

mjohnson · ‎08-04-2016

The Hive Streaming API enables the near real-time data ingestion into Hive. This two part posting reviews some of the design decisions necessary to produce a health Hive Streaming ingest process from which you can in a near real-time execute queries on the ingested data. Implementing a real-time Hive Streaming example Implementing a Hive Streaming data feed requires we make tradeoffs between the load on the NameNode versus the business SLAs for low latency data queries. Hive Streaming is able to work in a near real-time basis through the creation of a new ‘delta’ file on a bucketed Hive table with every table commit. For example, a Hive Streaming client is committing 1 row every second (it can actually commit at much faster rates), then after 1 minute there would exist 60 new delta files added o HDFS. After a day, there would be 86,400 new delta files and 15.5 million after 6 months just on your streamed data. With this many new HDFS Hive delta files, eventually the Hadoop cluster would encounter NameNode problems. When implementing your Hive Streaming solution it is important to keep in mind the following table list 4 of the most significant implementation concerns. Implementation Concern Design Decisions to address concern Need to reduce the latency between ingestion and availability for queries Decrease the number of rows per commit. Consequence – there will most likely be more delta files created which will impact the NameNode. Reduce the column count on streamed table to reduce the amount of data moved around during compaction. NameNode reaching capacity Increase the number of rows per commit. Consequence Reduction in the number of delta files within HDFS at any point in time, but there could be a longer lag before the data will be accessible to Hive Queries. Based on resources availability schedule regular major compactions. Consequence – If there exists a lot of data to compact and limited cluster sources, it could interfere with the performance of other jobs. Consider reducing the bucket and/or partition counts defined to the table in order to reduce the number of temporary delta files stored in HDFS in between compactions. Major compactions will consume too much of the cluster’s resources Assign the compaction to its own resource manager queue. Consider reducing the bucket and/or partition counts defined to the table in order to reduce the number of file moves required during compaction. Run compactions on a scheduled or manual basis during off-hours rather than allow automatic execution. Need to maximize per thread throughput Increase the number of rows per commit operations. This recommendation may seem contradictory, but it is true that data throughput is maximized by committing more rows at one time and data availability is maximized by decreasing the number of committed rows at one time. A Hive streaming example with the Hive API Let’s apply the above Hive Streaming design considerations in a simple example to write 10 columns of data at a rate of over 1 row per 100ms and see how the above concepts can apply to reality. Hive Streaming Required Configuration Settings The following table represents a list of settings available in Hive 1.2 (HDP 2.4) and which are required to support the ability to select Hive Streaming for your tables and which will support our example’s requirements. Configuration Value Description Hive.support.concurrency true Hive.enforce.bucketing true Ensures that hive correctly assigns the streamed row into the correct Hive bucket. Hive.exec.dynamic.partition.mode nonstrict Allows for the dynamic creation of partitions. Be careful that the streamed data ingested a low cardinality. Hive.txn.manager org.apache.hadoop.hive.ql.lockmgr.DbTxnManager The class to process the hive transactions necessary to support Hive streaming Hive.compactor.initiator.on True ??? Hive.compactor.worker.threads 1 Controlls the number of worker threads controlling the compaction process. Hive.txn.timeout Default 300 seconds Sometimes it is necessary to increase the value to avoid a large number of aborted transactions when the cluster is heavily taxed. Creating the destination table for the Hive stream Now that the system is configured to support Hive Streaming, it is time to create a transactional Hive table. The Hive table requirements to support Hive Streaming include: Hive table must be stored in ORC format Hive table must be bucketed. If desired, the table may also support partitioning along with the bucket definition. There must be sufficient disk temporary space to support compaction operations Need to specify the table as ‘transactional’ CREATE TABLE acidtest (a INT, b1 STRING, b2 STRING, b3 STRING,b4 STRING,b5 STRING, b6 STRING,b7 STRING,b8 STRING, b9 STRING) CLUSTERED BY(a) INTO 4 BUCKETS STORED AS ORC tblproperties("transactional"="true")"; In our example above, we see a CREATE TABLE nearly identical to most of the other Hive tables we have already created with one notable exception; we are specifying a table property of “transactional” equals to true. This property is necessary in order to tell Hive to utilize the transactional features to support streaming in Hive. The BUCKET count of 4 was selected to keep the number of delta bucket files to a minimum yet still support table JOINS to other tables such as CUSTOMER_INFORMATION table with 8 buckets. As 4 is a multiple of the 8 buckets found in the JOINed table, we will also with this schema be able to support real-time table BUCKET JOINs. In different experiments performed to research for this blog, it was found that dropping the “transactional introduces inconsistent results. The problem can be in some instances it will appear to work, so make absolutely certain that the “transactional” property is defined to your Hive streaming table. Establishing the Hive Connections Now that we have our destination Hive Streaming table created and available to Hive, it is now time to start creating our Hive Streaming client. For this blog posting to illustrate the concepts behind Hive Streaming, we will use the base Hive Streaming API. But, be aware that both Flume and Storm have components to support at a higher level the Hive Streaming process. For the example the next sections will walk through the full code and support files are available through in the github WriteHDFS.java available for your download. Our first step to create our streaming client is to define the HiveEndPoint connection to the thrift service. HiveEndPoint("thrift://"+hostname+":9083", "HIVE_DATABASE","HIVE_TABLE_NAME", null); The hostname and port name to configure for our connection can be retrieved from within Ambari by searching for the property value ‘hive.metastore.uris’, and should not be confused with the hiveserver2 hostname. Hive Streaming Writers Once the endpoint connection established we need to assign it to our streaming writer and specify the columns in our destination table. At the time of this writing, there are 3 writer classes available: DelimitedInputWriter (the writer used for this blog’s example) – Outputs the column values as a delimited string where you can specify the desired delimiter. StrictJsonWriter – Converts the JSON string passed to the writer into an Object using JsonSerde. AbstractRecordWriter – A base class available to create your own customer writers String[] colFlds = new String[]{"a", "b1", "b2", "b3", "b4","b5", "b6", "b7", "b8", "b9"}; writer = new DelimitedInputWriter(colFlds, ",", endPt); The DelimitedInputWriter which we are using for this example contains 3 parameters: A String[] array containing a list of the column names defined to the table. You need to make certain that these column names match up with the CREATE TABLE columns or you could encounter some errors. The string delimiter between each field. In this example, we are just doing a comma separated list. The HiveEndPoint connection reference Defining the Hive Transactions and writing to the stream Our next step is to open a collection of batch ids to use for the hive streaming. Hive defaults to 1,000 open transactions at one time, though this value may be specified by changing the hive property ‘hive.txn.max.open.batch’. Once there exists a pool of open transactions, the ‘beginNextTransaction()’ operation is able to start a transaction for Hive Streaming to write. Keep in mind that attempting to call ‘beginNextTransaction()’ when there are no more remaining transactions will produce an error, so it is useful to check ‘remaining Transactions()’ is greater than 0 before trying to start a new transaction. TransactionBatch txnBatch = connection.fetchTransactionBatch(maxBatchGroups, writer); txnBatch.beginNextTransaction(); for (int i = 0; i < writeRows; ++i) { writeStream(i, connection, writer, txnBatch, **false**); if (currentBatchSize < 1) { System.out.println(">"+threadId " Beginning Transaction Commit:"+i+" Transaction State:" +txnBatch.getCurrentTransactionState()); writer.flush(); txnBatch.commit(); if (txnBatch.remainingTransactions() > 0) { System.out.println(">"+threadId+" ->" +i+" txnBatch transactions remaining:" + txnBatch.remainingTransactions()); txnBatch.beginNextTransaction(); currentBatchSize = maxBatchSize; } else { System.out.println(">"+threadId +" Refreshing the transaction group count"); txnBatch = connection.fetchTransactionBatch(maxBatchGroups,writer); txnBatch.beginNextTransaction(); currentBatchSize = maxBatchSize; } } --currentBatchSize; } writer.flush(); txnBatch.commit(); txnBatch.close(); Some important implementation details to keep in mind include: The data is not available for queries until after the flush() and commit() operations have completed. When the fetchTransactionBatch operation executes the system opens ‘maxBatchGroups’ count transactions. In order to close them, make certain to execute the txnBatch.close() at the end of Hive Stream method as shown at the bottom of the example. Physical Data Changes as the Hive Streaming example executes To really understand how Hive Streaming works, it is useful to monitor the effect each step of the flow on the HDFS physical storage layer. There are two physical measures to look at as the Hive Streaming process executes: Count and size of the delta files Number of rows accessible to Hive Queries Looking at the physical data flow, there are 4 high level steps described below which should clarify how Hive Streaming impacts the processing flow. Step 1: After the table create and before the commit() operation: Right before the commit() operation and right after the flush, you will start to see the delta files appear in the warehouse subdirectory for your hive stream table. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 17:12 /apps/hive/warehouse/acidtest/_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:12 /apps/hive/warehouse/acidtest/delta_3238501_3239500 -rw-r--r-- 3 mjohnson hdfs 2792 2016-03-27 17:12 /apps/hive/warehouse/acidtest/delta_3238501_3239500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:12 /apps/hive/warehouse/acidtest/delta_3238501_3239500/bucket_00000_flush_length We can see that in bucket #0 2792 bytes of temporary data has been written to HDFS. Though at this point in time because the commit() has not run, you will see 0 record count upon executing a select count(*) from actidtest . Step 2: Immediately after the ‘commit()’ instruction However, as soon as the commit has completed, then we find that the ‘select count(*)…’ now returns 10,0001 rows as being part of the acidtest table. There are also 999 transactions remaining, so we do not need to call the method connection.fetchTransactionBatch(maxBatchGroups, writer); yet. Step 3: ‘maxBatchGroups’ commit() operations have completed, and no more open transactions Now that the maxBatchGroups (generally 1,000) commit() operations have completed and there are no more remainingTransactions, it is now necessary to pull an additional group of Transaction batches by calling ‘‘connection.fetchTransactionBatch( maxBatchGroups, writer);’ again. If we execute the command ‘SHOW TRANSACTIONS;’ before and after calling ‘fetchTransactionBatch’ we would see that the total number of OPEN transactions would increase by roughly the amount of the ‘maxBatchGroups’ variable value. As the stream continues to ingest data you will notice that the simple delta file list displayed below now has duplicated references to the same bucket (bucket #0). [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 17:35 /apps/hive/warehouse/acidtest/_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3296501_3297500 -rw-r--r-- 3 mjohnson hdfs 1521 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3296501_3297500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3296501_3297500/bucket_00000_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3297501_3298500 -rw-r--r-- 3 mjohnson hdfs 1545 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3297501_3298500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3297501_3298500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3298501_3299500 -rw-r--r-- 3 mjohnson hdfs 1518 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3298501_3299500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3298501_3299500/bucket_00000_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3299501_3300500 -rw-r--r-- 3 mjohnson hdfs 1519 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3299501_3300500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3299501_3300500/bucket_00000_flush_length As we see in the file listing above, there are 3 delta files for the transaction range of 3299501-3300500 and bucket #0. This is because there were multiple commits assigning values to bucket #0 for the open transaction range. If we increased the number of rows included in each commit as mentioned earlier in the Implementation section of this blog or played around with the maxBatchGroups value, we might actually see fewer of these small temporary delta files. The good news is while the hive streaming commit operations are adding new delta files, the compaction functionality is also running in the back ground trying to consolidate all of these files for more efficient Namenode operations and queries. Step 4: Completion of the Hive Stream process In many instances, the Hive Stream ingestion process never actually ends. New delta files are continually getting created in HDFS by the commit() step and the compaction processes are continually consolidating the delta files. As each transaction gets committed, it gets closed from the TRANSACTIONS list. In a healthy system, you will find that over time the number of transactions reported by the command ‘SHOW TRANSACTIONS’ should be less than the value you have assigned to ‘maxBatchGroups’. If the number is greater than the ‘maxBatchGroups’ then it is possible that for one of the transactional tables that the ingest process was stopped without executing a ‘txnBatch.close();’ operation which returns the unused transactions. Hive Streaming Compaction Hive Compaction Design considerations The Compaction process has a set of cleaner processes running in the background during the ingest process looking for opportunities to compact the delta files based on the rules you specify. The first thing to keep in mind is that there are two forms of Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just consolidate the delta files. This approach does not have to worry about consolidating all of the delta files along with a large set of base bucket files and is thus the least disruptive to the system resources. ‘major’ compaction consolidates all of the delta files just like the ‘minor’ compaction and in addition it consolidates the delta files with the base to produce a very clean physical layout for the hive table. However, major compactions can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully. To provide greater control over the compaction process and avoid impacting other processes in addition to the compactor configuration options available, it is also possible to invoke compaction automatically by the cleaner threads or manually initiated when system load is low. The primary compaction configuration triggers to review when implementing or tuning your compaction processes are: hive.compactor.initiator.on hive.compactor.cleaner.run.interval hive.compactor.delta.num.threshold - Number of delta directories in a table or partition that will trigger a minor compaction. hive.compactor.delta.pct.threshold - Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. hive.compactor.abortedtxn.threshold - Number of aborted transactions involving a given table or partition that will trigger a major compaction A Hive Compaction Manual example In our example we have turned off major compaction as it should only run during off load periods. We take a look at the delta files for our table in hdfs and see that there are over 300 delta files and 5 base files. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest/_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500 -rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2123501_2133500 -rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2133501_2143500 -rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2143501_2153500 -rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2153501_2163500 A decision has been been made to run the major compaction manually during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT ‘major’” command to place the compaction job into the queue for processing. A compaction resource management queue was defined with a limited quota resource, so the compaction will not impact other jobs. hive> alter table acidtest compact 'major'; Compaction enqueued. OK Time taken: 0.037 seconds hive> show compactions; OK Database Table Partition Type State Worker Start Time default acidtest NULL MAJOR working server2.hdp-26 1459100244000 Time taken: 0.019 seconds, Fetched: 2 row(s) hive> show compactions; OK Database Table Partition Type State Worker Start Time Time taken: 0.016 seconds, Fetched: 1 row(s) hive>; The outstanding table compaction jobs are visible by executing the command line “SHOW COMPACTIONS as illustrated in the example above. Or the ‘major’ compaction is also visible through the Applications history log. After the ‘major’ compaction has completed, all of the delta files available at the time the compaction was initiated will have rolled up into the ‘base’ tables. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500 -rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00002 [hive@server1 ~]$ The end result of this example is that 305 consolidated to just 5 files. While 300 files will not impact the NameNode performance, it will most likely improve query performance as the Hive engine will have fewer files to scan to execute the query. Next Steps After completing these examples, you should have a good idea on how to setup Hive streaming in your environment. For some additional information on Hive Streaming check out the Bibliography section at the bottom of this article. Bibliography Hopefully, the example and source code supplied with this blog posting are sufficient to get you started with Hive Streaming and avoid potential problems. In addition to this blog posting some other resources which are useful references include: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions http://hortonworks.com/blog/adding-acid-to-apache-hive/

mjohnson · ‎07-19-2016

And that resolved the issue?

Online	Offline
Last Visited	‎08-14-2019 02:15 AM

Member Since	‎09-24-2015 12:50 PM
Last Visited	‎08-14-2019 02:15 AM
Posts	32
Kudos received	54

Cloudera Community

Re: What is the best way to import existing taxono...

Re: List out the Metadata attributes in Hadoop

Re: Atlas hive is not showing lineage data

Re: Out of Memory Error in Hive

Re: What is the best way to import existing taxono...

Atlas TAG Based searches utilizing the Atlas REST ...

Add Custom properties to existing Atlas Types in s...

Atlas REST API Search Techniques

Re: How to add OR modify comment for an existing h...

Modify Atlas Entity properties using REST API comm...

Re: Hive Indexing

Hive Streaming Compaction

Implementing a real-time Hive Streaming example

Re: after upgrade hdp.