Member since 
    
	
		
		
		09-24-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                32
            
            
                Posts
            
        
                60
            
            
                Kudos Received
            
        
                4
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1855 | 02-10-2017 07:33 PM | |
| 2169 | 07-18-2016 02:14 PM | |
| 5589 | 07-14-2016 06:09 PM | |
| 20295 | 07-12-2016 07:59 PM | 
			
    
	
		
		
		04-20-2021
	
		
		07:33 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks for such a nice and detailed blog. I am looking for a solution to avoid duplicate records during hive streaming. Can anybody please help me ? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-06-2017
	
		
		06:59 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 Overview  
	Atlas provides powerful
Tagging capabilities which Data Analysts to identify all data sets containing
specific types of data.  The Atlas UI
itself provides a powerful Tag based search capability which require no REST
API interaction.  However, for those of
you out there who need to integrate Tag based search with some of their data
discovery and governance activities, this posting is for you.  Within this posting are some instructions
regarding how you can use the Atlas REST API to retrieve entity data based on a
TAG name.  
Before getting too deep into the Atlas Tag search examples it is important to recognize that Atlas Tags are basically a form of an Atlas type.  If you invoke the REST API command “/api/atlas/types”, in the summary output below interspersed between standard Atlas types such as ‘hive_table’, ‘jms_topic’, etc., will be the current set of user defined Atlas Tags (CUSTOMER & SALES) as shown below:   "count": 35,
 "requestId": "qtp1177377518-81 - c7d4a853-02a0-4a1e-9b50-f7375f6e5f08",
 "results": [
  "falcon_feed_replication",
 "falcon_process",
 "DataSet",
 "falcon_feed_creation",
 "file_action",
 "hive_order",
 "Process",
 "hive_table",
 "hive_db",
…
  "Infrastructure",
 "CUSTOMER",
 "Asset",
 "storm_spout",
 "SALES",
 "hive_column",
…
]  
In the rest of the article we will expand on the Atlas types API to explore how we can perform two different types of TAG based searches.  Before going too far it is important to note that the source code for the following examples are available through this repo.  Tag Search Example #1: Simple REST based Tag Based Search example  
	In our first Tag search example our objective is to return a
list of Atlas Data Entities which have the query TAG name assigned.  In this example, we are going to search our atlas
instance on (‘server1’ port 21000) for all Atlas entities with a tag named
CUSTOMER.  You will want to replace
CUSTOMER with an existing tag on your system.  
	Our Atlas DSL query to find the CUSTOMER tag using the ‘curl’
command is as shown below:  curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=CUSTOMER  
	The example above returns a list of the entity guids which
have the Atlas Tag ‘CUSTOMER’ defined to the Atlas host ‘server1’ on port
21000.  To run this query on your own
cluster or on a sandbox just substitute the Atlas Server Host URL, Atlas Server
Port number, login information and your Tag name and then invoke as shown above
with curl (or SimpleAtlasTagSearch.py in the Python example in the referenced
Repo at the end of this article).  
	An output from this REST API query on my cluster is shown
below:  	{
	  "count":
2,
"dataType": {
"attributeDefinitions": [
	…
	  ],
"typeDescription": null,
"typeName": "__tempQueryResultStruct120"
	  },
	  "query":
"CUSTOMER",
"queryType": "dsl",
"requestId": "qtp1177377518-81 -
624fc6b9-e3cc-4ab7-80ba-c6a57d6ef3fd",
	  "results":
[
	  {
"$typeName$": "__tempQueryResultStruct120",
"instanceInfo": {
"$typeName$": "__IdType",
"guid": "806362dc-0709-47ca-af16-fac81184c130",
"state": "ACTIVE",
	   "typeName":
"hive_table"
	  },
"traitDetails": null
	  },
	  {
"$typeName$": "__tempQueryResultStruct120",
"instanceInfo": {
"$typeName$": "__IdType",
	  "guid":
"4138c963-b20d-4d10-b338-2c334202af43",
"state": "ACTIVE",
"typeName": "hive_table"
	  },
"traitDetails": null
	  }
	  ]
	}  
	The results from this query can be thought of having 3
sections:  
	
 results header where you can find the results
count
	 	
 Returned DataTypes 	
 Results (list of entity guids)   
	For our purposes we are really only interested in the list
of entities, so all you need to do is focus on extracting the important
information from the .results jsonpath object in the return json object.  Looking at the results section we observe
that only one entity has the CUSTOMER tag assigned.  This entity located by the search has the
guid assigned of ‘4138c963-b20d-4d10-b338-2c334202af43’ we see is an active
entity (not deleted).  We can now use the
entity search capabilities to retrieve the actual entity as described in the
next example within this article.  Example #2: Returning details on all entities based on Tag assignment  
	The beauty of Example #1 is we can build an entity list
using a single REST API call.  However,
for the real world we will want access to details about the assigned
entities.  To accomplish this, we will
need a programming interface such as Python, Java, Scala, bash what your
favorite tool is, etc. to pull the GUIDs and then perform entity searches.  
	For the purposes of this posting, we will use Python to
illustrate how to perform more powerful Atlas Tag searches.  The example below performs two Atlas REST API
queries to build a json object containing the details and not just guids for
the entities with our Tag assigned.  def atlasGET( restAPI ) :<br>
  url = "http://"+ATLAS_DOMAIN+":"+ATLAS_PORT+restAPI<br>
  r= requests.get(url, auth=("admin", "admin"))return(json.loads(r.text));
results = atlasGET("/api/atlas/discovery/search/dsl?query={0}".format(TAG_NAME))
entityGuidList = results['results']
entityList = [] for entity in entityGuidList:
  guid = entity['instanceInfo']['guid']
  entityDetail = atlasGET("/api/atlas/entities/{0}".format(guid))
  entityList.append(entityDetail);
print json.dumps(entityList, indent=4, sort_keys=True)  
	The output from this script is now available for more
sophisticated data governance and data discovery projects.  Atlas Tag Based Search Limitations  
	As powerful as both the Atlas UI and Atlas REST API Tag
based searches are, there are some limitations to be aware:  
	
 Atlas supports only searching on one TAG at a
time.  
	 	
 It is impossible to include other entity
properties in the TAG searches
	 	
 The Atlas REST API used for TAG searches can
only return a list of GUIDs.
	 	
 It is not possible to search for TAG attributes  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		12-23-2016
	
		
		05:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Overview  Data Governance is unique for each organization and every
organization needs to track a different set of properties for their data
assets.  Fortunately, Atlas provides the
flexibility to add new data asset properties to support your organization’s
data governance requirements.  The objective
for this article is to describe the steps utilizing the Atlas REST API to add
new Atlas properties to your Atlas Types.   Add a new Property for an existing Atlas Type  To simplify this article, we will focus in on the 3 steps
required to add and enable for display a custom property to the standard Atlas
property ‘hive_table’.  Following these
steps, you should be able to modify the ‘hive_table’ Atlas Type and add custom
properties which are available to enter values, view in the Atlas UI and search.  To make the article easier to read the JSON file is shown in
small chunks.  To view the full JSON file
as well as other files used to research for this article, check out this repo.  Step 1: Define the custom property JSON  The most import step of this process is properly defining
the JSON used to update your Atlas Type. 
There are three parts to the JSON object we will pass to Atlas;   The header – contains the type identifier and
some other meta information required by Atlas  The actual new property definition  The required existing Atlas type properties   Defining the Header  Frankly, the header is just standard JSON elements which get
repeated every time you define a new property. 
The only change we need to make to the header block shown below for each
example is to get the ‘typeName’ JSON element properly set.  In our case as shown below we want to add a
property defined for all Hive tables so we have correctly defined the typeName
to be ‘hive_table’.  {"enumTypes": [],"structTypes": [],"traitTypes": [],"classTypes": [
  {"superTypes": ["DataSet"],"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType","typeName": "hive_table","typeDescription": null,  Keep in mind that all the JSON elements shown above pertain
to the Atlas type which we plan to modify.  Define the new Atlas Property  For this example, we are adding a property called ‘DataOwner’
which we intend to contain the owner of the data from a governance
perspective.  For our purposes, we have
the following requirements:  
   
   Requirement 
   Attribute Property 
   Assignment  
  
  
   The property is searchable 
   isIndexable 
   True 
  
  
   The property will contain a string 
   datatype 
   String 
  
  
   Not all Hive tables will have an owner 
   Multiplicity 
   Optional 
  
  
   A Data owner can be assigned to multiple Hive tables 
   isUnique 
   false 
  
   Based on the above requirements, we end up with a property
definition as shown below:  {"name": "DataOwner","dataTypeName": "string","multiplicity": "optional","isComposite": false,"isUnique": false,"isIndexable": true,"reverseAttributeName": null},  When defining Atlas properties, you can as shown in the file, it is possible to define multiple properties at one time, so take your
time and try and define all of the properties at once.  Make certain you include an existing Properties  An annoying thing about the Atlas v1 REST api is the need to
include some of the other key properties in your JSON file.  For this example, which was running on HDP
2.5.3 I had to define a bunch of properties. 
And every time you add a new custom property it is necessary to include
those custom properties in your JSON.  If
you check out the file JSON file used for this example you will find a
long list of properties which are required as of HDP 2.5.0.  Step 2: PUT the Atlas property update   We now have the full JSON request constructed with
our new property requirements.  So it is
time to PUT the JSON file using the ATLAS REST API v1.  For the text of this article I am using ‘curl’
to make the example clearer, though for the associated repo python is
used to make life a little easier.  To execute the PUT REST request we will first need to
collect the following data elements:  
   
   Data Element 
   Where to find it 
  
  
   Atlas Admin User Id 
   This is a defined ‘administrative’ user for the Atlas
  System.  It is the same user id which
  you use to log into Atlas. 
  
  
   Atlas Password 
   The password associated with Atlas Admin User Id 
  
  
   Atlas Server 
   The Atlas Metadata Server. 
  This can be found by selecting the Atlas server from Ambari and then
  looking in the summary tab.  
  
  
   Atlas Port 
   It is normally 21000. 
  Check the Ambari Atlas configs for the specific port in your cluster 
  
  
   Update_hive_table_type.json 
   This is the name of the JSON file containing our new Atlas
  property definition 
  
  
 curl  -ivH -d @update_hive_table_type.json
--header "Content-Type: application/json" -u {Atlas Admin User Id}:{Atlas Password} -X PUT http://{Atlas Server}:{Atlas Port}/api/atlas/types  If all is successful, then we should see a result like that
which is shown below.  The only thing you
will need to verify in the result (other than the lack of any reported errors)
is that then “name” element is the same as the Atlas type to which you are adding
a new custom property.  { 
"requestId": "qtp1177377518-235-fcf1c6f4-5993-49ac-8f5b-cdaafd01f2c0",
   "types":
	[  { 
		"name": "hive_table"  
}  ]}  However, if you are like me, then you probably will make a
couple of mistakes along the way.  To
help you identify root cause for your errors, here is a short list of errors
and how to resolve them:  Error #1: Missing a necessary Atlas property for the Type   An error encountered like shown below is because your JSON
with the new custom property is missing an existing property.  {    "error":
"hive_table can't be updated - Old Attribute stats:numRows is
missing",    
"stackTrace":
"org.apache.atlas.typesystem.types.TypeUpdateException: hive_table can't
be updated - Old Attribute stats:numRows is missing\n\tat   The solution to fix this problem is to add that property along
with your custom property in your JSON file. 
If you are uncertain as to the exact definition for the property, then
execute the execute Atlas REST API GET call as shown below to list out the
Atlas Type you are currently modifying properties:  curl -H –u
{Atlas Admin User id}:{Atlas password}-X GET http://{Atlas
Server}/api/atlas/types  Error #2: Unknown datatype:  An error occurred like the one below:  {    "error":
"Unknown datatype: XRAY",    
"stackTrace":
"org.apache.atlas.typesystem.exception.TypeNotFoundException: Unknown   In this case, you have entered an incorrect Atlas Data Type.  The allowed for data types include:   byte  short  int  long  float  double  biginteger  bigdecimal  date  string  {custom types}   The {custom types} enables you to reference another Atlas
type.  So for example you decide to
create a ‘SecurityRules’ Atlas data type which itself contains a list of
properties, you would just insert the SecurityRules type name as the property.  Error #n: Added incorrectly a new Atlas property for a type and you need to
delete it  This is the reason why you ALWAYS want to modify Atlas Types
and Properties in a Sandbox developer region. 
DO NOT EXPERIMENT WITH CUSTOMING ATLAS TYPES IN PRODUCTION!!!!! If you
ignore this standard approach in most organizations SLDC, your solution is to
delete the Atlas Service from within Ambari, re-add the service and then re-add
all your data.  Not fun.  Step 3: Check out the results  As we see above, our new custom Atlas ‘hive_table’ property
is now visible in the Atlas UI for all tables. 
As the property was just defined for all ‘hive_table’ data assets the
value is null.  Your next step which is
covered in the Article Modify Atlas Entity properties using REST API commands is to assign a value the new property.  Bibliography   Atlas Rest API  Atlas Technical User Guide  Atlas REST API Search Techniques  Modify Atlas Entity properties using REST API commands   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		09-16-2018
	
		
		12:41 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 This article is really very useful but has a silly but confusing (specially for HDP newbies) error where all occurrences of "Ranger user id" and "Ranger Admin Server" must be replaced by "Atlas User ID" and "Atlas Admin Server" respectively. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-26-2016
	
		
		01:12 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 The article: Modify Atlas Entity properties using REST API commands contains a full description for how to update both the comment and description entity properties for Atlas managed hive_table types. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-23-2017
	
		
		11:54 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @mjohnson Thanks for detailed explanation on updating entities. I have a query in the command you used to updated the description of entity.   The command you used to update the description doesn't contain the actual string that needs to be replaced.  Do we need to add it in the command while executing? something like the below  http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description:"I get my answers from HCC"  Thanks 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-04-2016
	
		
		10:37 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							
 Hive Streaming Compaction 
 This is the second part of the Hive Streaming Article series.  In this article we will review the issues around compacting Hive Streaming files.   
 One of the results of ingesting data through Hive streaming is the creation of many small 'Delta' files.  Left uncompacted you could run the risk of running into NameNode capacity problems.  Fortunately, compaction functionality is part of Hive Streaming.  The remainder of this Article reviews design considerations as well as commands necessary to enable and control compaction for your Hive tables. 
 Hive Compaction Design considerations 
 The Compaction process has a set of cleaner processes running in the
background during the ingest process looking for opportunities to
compact the delta files based on the rules you specify. 
 The first thing to keep in mind is that there are two forms of
Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just
consolidate the delta files. This approach does not have to worry about
consolidating all of the delta files along with a large set of base
bucket files and is thus the least disruptive to the system resources.
‘major’ compaction consolidates all of the delta files just like the
‘minor’ compaction and in addition it consolidates the delta files with
the base to produce a very clean physical layout for the hive table.
However, major compactions can take minutes to hours and can consume a
lot of disk, network, memory and CPU resources, so they should be
invoked carefully. 
 To provide greater control over the compaction process and avoid
impacting other processes in addition to the compactor configuration
options available, it is also possible to invoke compaction
automatically by the cleaner threads or manually initiated when system
load is low. 
 The primary compaction configuration triggers to review when
implementing or tuning your compaction processes are: 
 
   hive.compactor.initiator.on 
   hive.compactor.cleaner.run.interval 
   hive.compactor.delta.num.threshold - Number of delta directories in
a table or partition that will trigger a minor compaction. 
   hive.compactor.delta.pct.threshold - Percentage (fractional) size of
the delta files relative to the base that will trigger a
major compaction. 1 = 100%, so the default 0.1 = 10%. 
   hive.compactor.abortedtxn.threshold - Number of aborted transactions
involving a given table or partition that will trigger a major
compaction 
 
 A Hive Compaction Manual example 
 In our example we have turned off major compaction as it should only run
during off load periods. We take a look at the delta files for our table
in hdfs and see that there are over 300 delta files and 5 base files. 
   [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17        /apps/hive/warehouse/acidtest/_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18        /apps/hive/warehouse/acidtest/delta_2113501_2123500
-rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18   /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18     /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17        /apps/hive/warehouse/acidtest/delta_2123501_2133500
-rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18   /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18     /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17        /apps/hive/warehouse/acidtest/delta_2133501_2143500
-rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18   /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18     /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17        /apps/hive/warehouse/acidtest/delta_2143501_2153500
-rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18   /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18     /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17        /apps/hive/warehouse/acidtest/delta_2153501_2163500   
 A decision has been been made to run the major compaction manually
during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT
‘major’” command to place the compaction job into the queue for
processing. A compaction resource management queue was defined with a
limited quota resource, so the compaction will not impact other jobs. 
   hive> alter table acidtest compact 'major';
Compaction enqueued.
OK
Time taken: 0.037 seconds
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
default acidtest NULL MAJOR working server2.hdp-26 1459100244000
Time taken: 0.019 seconds, Fetched: 2 row(s)
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
Time taken: 0.016 seconds, Fetched: 1 row(s)
hive>;   
 The outstanding table compaction jobs are visible by executing the
command line “SHOW COMPACTIONS as illustrated in the example above. Or
the ‘major’ compaction is also visible through the Applications history
log. After the ‘major’ compaction has completed, all of the delta files
available at the time the compaction was initiated will have rolled up
into the ‘base’ tables. 
   [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17       /apps/hive/warehouse/acidtest_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37       /apps/hive/warehouse/acidtest/base_2213500
-rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37   /apps/hive/warehouse/acidtest/base_2213500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37  /apps/hive/warehouse/acidtest/base_2213500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37  /apps/hive/warehouse/acidtest/base_2213500/bucket_00002
[hive@server1 ~]$   
 The end result of this example is that 305 consolidated to just 5 files.
While 300 files will not impact the NameNode performance, it will most
likely improve query performance as the Hive engine will have fewer
files to scan to execute the query. 
 Bibliography 
 Hopefully, the example and source code supplied with this blog posting
are sufficient to get you started with Hive Streaming and avoid
potential problems. In addition to this blog posting some other
resources which are useful references include: 
 
  https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions  
  http://hortonworks.com/blog/adding-acid-to-apache-hive/  
  https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-transactions.html  
  http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a  
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		06-09-2016
	
		
		04:07 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Nice table -- what version of HDP did you base it on? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-29-2016
	
		
		12:36 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 A couple of comments:  1.  The section on setting up a dual homed network is correct, but misleading.  Most people who set up dual-homed networks would expect to spread at least some of the load over the interfaces, but Hadoop code is just not network aware in that sense. So it is *much* better to use bonding/link aggregation for network redundancy.  2.  In this day and age, don't even think about using 1Gb ports.  Use at least 2 10Gb ports.  Cloud providers are *today* installing 50Gb networks to their servers - 2x25Gb or 1x50Gb.  You're wasting a LOT of CPU if you don't give them enough bandwidth. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













