About mjohnson

Dhiru · ‎04-20-2021

Thanks for such a nice and detailed blog. I am looking for a solution to avoid duplicate records during hive streaming. Can anybody please help me ?

mjohnson · ‎02-06-2017

Overview Atlas provides powerful Tagging capabilities which Data Analysts to identify all data sets containing specific types of data. The Atlas UI itself provides a powerful Tag based search capability which require no REST API interaction. However, for those of you out there who need to integrate Tag based search with some of their data discovery and governance activities, this posting is for you. Within this posting are some instructions regarding how you can use the Atlas REST API to retrieve entity data based on a TAG name. Before getting too deep into the Atlas Tag search examples it is important to recognize that Atlas Tags are basically a form of an Atlas type. If you invoke the REST API command “/api/atlas/types”, in the summary output below interspersed between standard Atlas types such as ‘hive_table’, ‘jms_topic’, etc., will be the current set of user defined Atlas Tags (CUSTOMER & SALES) as shown below: "count": 35, "requestId": "qtp1177377518-81 - c7d4a853-02a0-4a1e-9b50-f7375f6e5f08", "results": [ "falcon_feed_replication", "falcon_process", "DataSet", "falcon_feed_creation", "file_action", "hive_order", "Process", "hive_table", "hive_db", … "Infrastructure", "CUSTOMER", "Asset", "storm_spout", "SALES", "hive_column", … ] In the rest of the article we will expand on the Atlas types API to explore how we can perform two different types of TAG based searches. Before going too far it is important to note that the source code for the following examples are available through this repo. Tag Search Example #1: Simple REST based Tag Based Search example In our first Tag search example our objective is to return a list of Atlas Data Entities which have the query TAG name assigned. In this example, we are going to search our atlas instance on (‘server1’ port 21000) for all Atlas entities with a tag named CUSTOMER. You will want to replace CUSTOMER with an existing tag on your system. Our Atlas DSL query to find the CUSTOMER tag using the ‘curl’ command is as shown below: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=CUSTOMER The example above returns a list of the entity guids which have the Atlas Tag ‘CUSTOMER’ defined to the Atlas host ‘server1’ on port 21000. To run this query on your own cluster or on a sandbox just substitute the Atlas Server Host URL, Atlas Server Port number, login information and your Tag name and then invoke as shown above with curl (or SimpleAtlasTagSearch.py in the Python example in the referenced Repo at the end of this article). An output from this REST API query on my cluster is shown below: { "count": 2, "dataType": { "attributeDefinitions": [ … ], "typeDescription": null, "typeName": "__tempQueryResultStruct120" }, "query": "CUSTOMER", "queryType": "dsl", "requestId": "qtp1177377518-81 - 624fc6b9-e3cc-4ab7-80ba-c6a57d6ef3fd", "results": [ { "$typeName$": "__tempQueryResultStruct120", "instanceInfo": { "$typeName$": "__IdType", "guid": "806362dc-0709-47ca-af16-fac81184c130", "state": "ACTIVE", "typeName": "hive_table" }, "traitDetails": null }, { "$typeName$": "__tempQueryResultStruct120", "instanceInfo": { "$typeName$": "__IdType", "guid": "4138c963-b20d-4d10-b338-2c334202af43", "state": "ACTIVE", "typeName": "hive_table" }, "traitDetails": null } ] } The results from this query can be thought of having 3 sections: results header where you can find the results count Returned DataTypes Results (list of entity guids) For our purposes we are really only interested in the list of entities, so all you need to do is focus on extracting the important information from the .results jsonpath object in the return json object. Looking at the results section we observe that only one entity has the CUSTOMER tag assigned. This entity located by the search has the guid assigned of ‘4138c963-b20d-4d10-b338-2c334202af43’ we see is an active entity (not deleted). We can now use the entity search capabilities to retrieve the actual entity as described in the next example within this article. Example #2: Returning details on all entities based on Tag assignment The beauty of Example #1 is we can build an entity list using a single REST API call. However, for the real world we will want access to details about the assigned entities. To accomplish this, we will need a programming interface such as Python, Java, Scala, bash what your favorite tool is, etc. to pull the GUIDs and then perform entity searches. For the purposes of this posting, we will use Python to illustrate how to perform more powerful Atlas Tag searches. The example below performs two Atlas REST API queries to build a json object containing the details and not just guids for the entities with our Tag assigned. def atlasGET( restAPI ) :<br> url = "http://"+ATLAS_DOMAIN+":"+ATLAS_PORT+restAPI<br> r= requests.get(url, auth=("admin", "admin"))return(json.loads(r.text)); results = atlasGET("/api/atlas/discovery/search/dsl?query={0}".format(TAG_NAME)) entityGuidList = results['results'] entityList = [] for entity in entityGuidList: guid = entity['instanceInfo']['guid'] entityDetail = atlasGET("/api/atlas/entities/{0}".format(guid)) entityList.append(entityDetail); print json.dumps(entityList, indent=4, sort_keys=True) The output from this script is now available for more sophisticated data governance and data discovery projects. Atlas Tag Based Search Limitations As powerful as both the Atlas UI and Atlas REST API Tag based searches are, there are some limitations to be aware: Atlas supports only searching on one TAG at a time. It is impossible to include other entity properties in the TAG searches The Atlas REST API used for TAG searches can only return a list of GUIDs. It is not possible to search for TAG attributes

mjohnson · ‎12-23-2016

Overview Data Governance is unique for each organization and every organization needs to track a different set of properties for their data assets. Fortunately, Atlas provides the flexibility to add new data asset properties to support your organization’s data governance requirements. The objective for this article is to describe the steps utilizing the Atlas REST API to add new Atlas properties to your Atlas Types. Add a new Property for an existing Atlas Type To simplify this article, we will focus in on the 3 steps required to add and enable for display a custom property to the standard Atlas property ‘hive_table’. Following these steps, you should be able to modify the ‘hive_table’ Atlas Type and add custom properties which are available to enter values, view in the Atlas UI and search. To make the article easier to read the JSON file is shown in small chunks. To view the full JSON file as well as other files used to research for this article, check out this repo. Step 1: Define the custom property JSON The most import step of this process is properly defining the JSON used to update your Atlas Type. There are three parts to the JSON object we will pass to Atlas; The header – contains the type identifier and some other meta information required by Atlas The actual new property definition The required existing Atlas type properties Defining the Header Frankly, the header is just standard JSON elements which get repeated every time you define a new property. The only change we need to make to the header block shown below for each example is to get the ‘typeName’ JSON element properly set. In our case as shown below we want to add a property defined for all Hive tables so we have correctly defined the typeName to be ‘hive_table’. {"enumTypes": [],"structTypes": [],"traitTypes": [],"classTypes": [ {"superTypes": ["DataSet"],"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType","typeName": "hive_table","typeDescription": null, Keep in mind that all the JSON elements shown above pertain to the Atlas type which we plan to modify. Define the new Atlas Property For this example, we are adding a property called ‘DataOwner’ which we intend to contain the owner of the data from a governance perspective. For our purposes, we have the following requirements: Requirement Attribute Property Assignment The property is searchable isIndexable True The property will contain a string datatype String Not all Hive tables will have an owner Multiplicity Optional A Data owner can be assigned to multiple Hive tables isUnique false Based on the above requirements, we end up with a property definition as shown below: {"name": "DataOwner","dataTypeName": "string","multiplicity": "optional","isComposite": false,"isUnique": false,"isIndexable": true,"reverseAttributeName": null}, When defining Atlas properties, you can as shown in the file, it is possible to define multiple properties at one time, so take your time and try and define all of the properties at once. Make certain you include an existing Properties An annoying thing about the Atlas v1 REST api is the need to include some of the other key properties in your JSON file. For this example, which was running on HDP 2.5.3 I had to define a bunch of properties. And every time you add a new custom property it is necessary to include those custom properties in your JSON. If you check out the file JSON file used for this example you will find a long list of properties which are required as of HDP 2.5.0. Step 2: PUT the Atlas property update We now have the full JSON request constructed with our new property requirements. So it is time to PUT the JSON file using the ATLAS REST API v1. For the text of this article I am using ‘curl’ to make the example clearer, though for the associated repo python is used to make life a little easier. To execute the PUT REST request we will first need to collect the following data elements: Data Element Where to find it Atlas Admin User Id This is a defined ‘administrative’ user for the Atlas System. It is the same user id which you use to log into Atlas. Atlas Password The password associated with Atlas Admin User Id Atlas Server The Atlas Metadata Server. This can be found by selecting the Atlas server from Ambari and then looking in the summary tab. Atlas Port It is normally 21000. Check the Ambari Atlas configs for the specific port in your cluster Update_hive_table_type.json This is the name of the JSON file containing our new Atlas property definition curl -ivH -d @update_hive_table_type.json --header "Content-Type: application/json" -u {Atlas Admin User Id}:{Atlas Password} -X PUT http://{Atlas Server}:{Atlas Port}/api/atlas/types If all is successful, then we should see a result like that which is shown below. The only thing you will need to verify in the result (other than the lack of any reported errors) is that then “name” element is the same as the Atlas type to which you are adding a new custom property. { "requestId": "qtp1177377518-235-fcf1c6f4-5993-49ac-8f5b-cdaafd01f2c0", "types": [ { "name": "hive_table" } ]} However, if you are like me, then you probably will make a couple of mistakes along the way. To help you identify root cause for your errors, here is a short list of errors and how to resolve them: Error #1: Missing a necessary Atlas property for the Type An error encountered like shown below is because your JSON with the new custom property is missing an existing property. { "error": "hive_table can't be updated - Old Attribute stats:numRows is missing", "stackTrace": "org.apache.atlas.typesystem.types.TypeUpdateException: hive_table can't be updated - Old Attribute stats:numRows is missing\n\tat The solution to fix this problem is to add that property along with your custom property in your JSON file. If you are uncertain as to the exact definition for the property, then execute the execute Atlas REST API GET call as shown below to list out the Atlas Type you are currently modifying properties: curl -H –u {Atlas Admin User id}:{Atlas password}-X GET http://{Atlas Server}/api/atlas/types Error #2: Unknown datatype: An error occurred like the one below: { "error": "Unknown datatype: XRAY", "stackTrace": "org.apache.atlas.typesystem.exception.TypeNotFoundException: Unknown In this case, you have entered an incorrect Atlas Data Type. The allowed for data types include: byte short int long float double biginteger bigdecimal date string {custom types} The {custom types} enables you to reference another Atlas type. So for example you decide to create a ‘SecurityRules’ Atlas data type which itself contains a list of properties, you would just insert the SecurityRules type name as the property. Error #n: Added incorrectly a new Atlas property for a type and you need to delete it This is the reason why you ALWAYS want to modify Atlas Types and Properties in a Sandbox developer region. DO NOT EXPERIMENT WITH CUSTOMING ATLAS TYPES IN PRODUCTION!!!!! If you ignore this standard approach in most organizations SLDC, your solution is to delete the Atlas Service from within Ambari, re-add the service and then re-add all your data. Not fun. Step 3: Check out the results As we see above, our new custom Atlas ‘hive_table’ property is now visible in the Atlas UI for all tables. As the property was just defined for all ‘hive_table’ data assets the value is null. Your next step which is covered in the Article Modify Atlas Entity properties using REST API commands is to assign a value the new property. Bibliography Atlas Rest API Atlas Technical User Guide Atlas REST API Search Techniques Modify Atlas Entity properties using REST API commands

lvazquez · ‎09-16-2018

This article is really very useful but has a silly but confusing (specially for HDP newbies) error where all occurrences of "Ranger user id" and "Ranger Admin Server" must be replaced by "Atlas User ID" and "Atlas Admin Server" respectively.

mjohnson · ‎10-26-2016

The article: Modify Atlas Entity properties using REST API commands contains a full description for how to update both the comment and description entity properties for Atlas managed hive_table types.

kumaresan_gs · ‎06-23-2017

@mjohnson Thanks for detailed explanation on updating entities. I have a query in the command you used to updated the description of entity. The command you used to update the description doesn't contain the actual string that needs to be replaced. Do we need to add it in the command while executing? something like the below http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description:"I get my answers from HCC" Thanks

mjohnson · ‎08-04-2016

Hive Streaming Compaction This is the second part of the Hive Streaming Article series. In this article we will review the issues around compacting Hive Streaming files. One of the results of ingesting data through Hive streaming is the creation of many small 'Delta' files. Left uncompacted you could run the risk of running into NameNode capacity problems. Fortunately, compaction functionality is part of Hive Streaming. The remainder of this Article reviews design considerations as well as commands necessary to enable and control compaction for your Hive tables. Hive Compaction Design considerations The Compaction process has a set of cleaner processes running in the background during the ingest process looking for opportunities to compact the delta files based on the rules you specify. The first thing to keep in mind is that there are two forms of Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just consolidate the delta files. This approach does not have to worry about consolidating all of the delta files along with a large set of base bucket files and is thus the least disruptive to the system resources. ‘major’ compaction consolidates all of the delta files just like the ‘minor’ compaction and in addition it consolidates the delta files with the base to produce a very clean physical layout for the hive table. However, major compactions can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully. To provide greater control over the compaction process and avoid impacting other processes in addition to the compactor configuration options available, it is also possible to invoke compaction automatically by the cleaner threads or manually initiated when system load is low. The primary compaction configuration triggers to review when implementing or tuning your compaction processes are: hive.compactor.initiator.on hive.compactor.cleaner.run.interval hive.compactor.delta.num.threshold - Number of delta directories in a table or partition that will trigger a minor compaction. hive.compactor.delta.pct.threshold - Percentage (fractional) size of the delta files relative to the base that will trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. hive.compactor.abortedtxn.threshold - Number of aborted transactions involving a given table or partition that will trigger a major compaction A Hive Compaction Manual example In our example we have turned off major compaction as it should only run during off load periods. We take a look at the delta files for our table in hdfs and see that there are over 300 delta files and 5 base files. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest/_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500 -rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2123501_2133500 -rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2133501_2143500 -rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2143501_2153500 -rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2153501_2163500 A decision has been been made to run the major compaction manually during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT ‘major’” command to place the compaction job into the queue for processing. A compaction resource management queue was defined with a limited quota resource, so the compaction will not impact other jobs. hive> alter table acidtest compact 'major'; Compaction enqueued. OK Time taken: 0.037 seconds hive> show compactions; OK Database Table Partition Type State Worker Start Time default acidtest NULL MAJOR working server2.hdp-26 1459100244000 Time taken: 0.019 seconds, Fetched: 2 row(s) hive> show compactions; OK Database Table Partition Type State Worker Start Time Time taken: 0.016 seconds, Fetched: 1 row(s) hive>; The outstanding table compaction jobs are visible by executing the command line “SHOW COMPACTIONS as illustrated in the example above. Or the ‘major’ compaction is also visible through the Applications history log. After the ‘major’ compaction has completed, all of the delta files available at the time the compaction was initiated will have rolled up into the ‘base’ tables. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest -rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest_orc_acid_version drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500 -rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00000 -rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00001 -rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00002 [hive@server1 ~]$ The end result of this example is that 305 consolidated to just 5 files. While 300 files will not impact the NameNode performance, it will most likely improve query performance as the Hive engine will have fewer files to scan to execute the query. Bibliography Hopefully, the example and source code supplied with this blog posting are sufficient to get you started with Hive Streaming and avoid potential problems. In addition to this blog posting some other resources which are useful references include: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions http://hortonworks.com/blog/adding-acid-to-apache-hive/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-transactions.html http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a

elserj · ‎06-09-2016

Nice table -- what version of HDP did you base it on?

pugs · ‎11-29-2016

A couple of comments: 1. The section on setting up a dual homed network is correct, but misleading. Most people who set up dual-homed networks would expect to spread at least some of the load over the interfaces, but Hadoop code is just not network aware in that sense. So it is *much* better to use bonding/link aggregation for network redundancy. 2. In this day and age, don't even think about using 1Gb ports. Use at least 2 10Gb ports. Cloud providers are *today* installing 50Gb networks to their servers - 2x25Gb or 1x50Gb. You're wasting a LOT of CPU if you don't give them enough bandwidth.

Online	Offline
Last Visited	‎08-14-2019 02:15 AM

Member Since	‎09-24-2015 12:50 PM
Last Visited	‎08-14-2019 02:15 AM
Posts	32
Kudos received	54

Cloudera Community

Re: What is the best way to import existing taxono...

Re: List out the Metadata attributes in Hadoop

Re: Atlas hive is not showing lineage data

Re: Out of Memory Error in Hive

Re: Implementing a real-time Hive Streaming exampl...

Atlas TAG Based searches utilizing the Atlas REST ...

Add Custom properties to existing Atlas Types in s...

Re: Atlas REST API Search Techniques

Re: How to add OR modify comment for an existing h...

Re: Modify Atlas Entity properties using REST API ...

Hive Streaming Compaction

Re: HDP services supporting RESTFul read, write an...

Re: Typical HDP Cluster Network Configuration Best...