Member since
09-24-2015
32
Posts
60
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1451 | 02-10-2017 07:33 PM | |
1747 | 07-18-2016 02:14 PM | |
4536 | 07-14-2016 06:09 PM | |
18941 | 07-12-2016 07:59 PM |
04-20-2021
07:33 PM
Thanks for such a nice and detailed blog. I am looking for a solution to avoid duplicate records during hive streaming. Can anybody please help me ?
... View more
02-06-2017
06:59 PM
4 Kudos
Overview
Atlas provides powerful
Tagging capabilities which Data Analysts to identify all data sets containing
specific types of data. The Atlas UI
itself provides a powerful Tag based search capability which require no REST
API interaction. However, for those of
you out there who need to integrate Tag based search with some of their data
discovery and governance activities, this posting is for you. Within this posting are some instructions
regarding how you can use the Atlas REST API to retrieve entity data based on a
TAG name.
Before getting too deep into the Atlas Tag search examples it is important to recognize that Atlas Tags are basically a form of an Atlas type. If you invoke the REST API command “/api/atlas/types”, in the summary output below interspersed between standard Atlas types such as ‘hive_table’, ‘jms_topic’, etc., will be the current set of user defined Atlas Tags (CUSTOMER & SALES) as shown below: "count": 35,
"requestId": "qtp1177377518-81 - c7d4a853-02a0-4a1e-9b50-f7375f6e5f08",
"results": [
"falcon_feed_replication",
"falcon_process",
"DataSet",
"falcon_feed_creation",
"file_action",
"hive_order",
"Process",
"hive_table",
"hive_db",
…
"Infrastructure",
"CUSTOMER",
"Asset",
"storm_spout",
"SALES",
"hive_column",
…
]
In the rest of the article we will expand on the Atlas types API to explore how we can perform two different types of TAG based searches. Before going too far it is important to note that the source code for the following examples are available through this repo. Tag Search Example #1: Simple REST based Tag Based Search example
In our first Tag search example our objective is to return a
list of Atlas Data Entities which have the query TAG name assigned. In this example, we are going to search our atlas
instance on (‘server1’ port 21000) for all Atlas entities with a tag named
CUSTOMER. You will want to replace
CUSTOMER with an existing tag on your system.
Our Atlas DSL query to find the CUSTOMER tag using the ‘curl’
command is as shown below: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=CUSTOMER
The example above returns a list of the entity guids which
have the Atlas Tag ‘CUSTOMER’ defined to the Atlas host ‘server1’ on port
21000. To run this query on your own
cluster or on a sandbox just substitute the Atlas Server Host URL, Atlas Server
Port number, login information and your Tag name and then invoke as shown above
with curl (or SimpleAtlasTagSearch.py in the Python example in the referenced
Repo at the end of this article).
An output from this REST API query on my cluster is shown
below: {
"count":
2,
"dataType": {
"attributeDefinitions": [
…
],
"typeDescription": null,
"typeName": "__tempQueryResultStruct120"
},
"query":
"CUSTOMER",
"queryType": "dsl",
"requestId": "qtp1177377518-81 -
624fc6b9-e3cc-4ab7-80ba-c6a57d6ef3fd",
"results":
[
{
"$typeName$": "__tempQueryResultStruct120",
"instanceInfo": {
"$typeName$": "__IdType",
"guid": "806362dc-0709-47ca-af16-fac81184c130",
"state": "ACTIVE",
"typeName":
"hive_table"
},
"traitDetails": null
},
{
"$typeName$": "__tempQueryResultStruct120",
"instanceInfo": {
"$typeName$": "__IdType",
"guid":
"4138c963-b20d-4d10-b338-2c334202af43",
"state": "ACTIVE",
"typeName": "hive_table"
},
"traitDetails": null
}
]
}
The results from this query can be thought of having 3
sections:
results header where you can find the results
count
Returned DataTypes
Results (list of entity guids)
For our purposes we are really only interested in the list
of entities, so all you need to do is focus on extracting the important
information from the .results jsonpath object in the return json object. Looking at the results section we observe
that only one entity has the CUSTOMER tag assigned. This entity located by the search has the
guid assigned of ‘4138c963-b20d-4d10-b338-2c334202af43’ we see is an active
entity (not deleted). We can now use the
entity search capabilities to retrieve the actual entity as described in the
next example within this article. Example #2: Returning details on all entities based on Tag assignment
The beauty of Example #1 is we can build an entity list
using a single REST API call. However,
for the real world we will want access to details about the assigned
entities. To accomplish this, we will
need a programming interface such as Python, Java, Scala, bash what your
favorite tool is, etc. to pull the GUIDs and then perform entity searches.
For the purposes of this posting, we will use Python to
illustrate how to perform more powerful Atlas Tag searches. The example below performs two Atlas REST API
queries to build a json object containing the details and not just guids for
the entities with our Tag assigned. def atlasGET( restAPI ) :<br>
url = "http://"+ATLAS_DOMAIN+":"+ATLAS_PORT+restAPI<br>
r= requests.get(url, auth=("admin", "admin"))return(json.loads(r.text));
results = atlasGET("/api/atlas/discovery/search/dsl?query={0}".format(TAG_NAME))
entityGuidList = results['results']
entityList = [] for entity in entityGuidList:
guid = entity['instanceInfo']['guid']
entityDetail = atlasGET("/api/atlas/entities/{0}".format(guid))
entityList.append(entityDetail);
print json.dumps(entityList, indent=4, sort_keys=True)
The output from this script is now available for more
sophisticated data governance and data discovery projects. Atlas Tag Based Search Limitations
As powerful as both the Atlas UI and Atlas REST API Tag
based searches are, there are some limitations to be aware:
Atlas supports only searching on one TAG at a
time.
It is impossible to include other entity
properties in the TAG searches
The Atlas REST API used for TAG searches can
only return a list of GUIDs.
It is not possible to search for TAG attributes
... View more
Labels:
12-23-2016
05:01 PM
Overview Data Governance is unique for each organization and every
organization needs to track a different set of properties for their data
assets. Fortunately, Atlas provides the
flexibility to add new data asset properties to support your organization’s
data governance requirements. The objective
for this article is to describe the steps utilizing the Atlas REST API to add
new Atlas properties to your Atlas Types. Add a new Property for an existing Atlas Type To simplify this article, we will focus in on the 3 steps
required to add and enable for display a custom property to the standard Atlas
property ‘hive_table’. Following these
steps, you should be able to modify the ‘hive_table’ Atlas Type and add custom
properties which are available to enter values, view in the Atlas UI and search. To make the article easier to read the JSON file is shown in
small chunks. To view the full JSON file
as well as other files used to research for this article, check out this repo. Step 1: Define the custom property JSON The most import step of this process is properly defining
the JSON used to update your Atlas Type.
There are three parts to the JSON object we will pass to Atlas; The header – contains the type identifier and
some other meta information required by Atlas The actual new property definition The required existing Atlas type properties Defining the Header Frankly, the header is just standard JSON elements which get
repeated every time you define a new property.
The only change we need to make to the header block shown below for each
example is to get the ‘typeName’ JSON element properly set. In our case as shown below we want to add a
property defined for all Hive tables so we have correctly defined the typeName
to be ‘hive_table’. {"enumTypes": [],"structTypes": [],"traitTypes": [],"classTypes": [
{"superTypes": ["DataSet"],"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType","typeName": "hive_table","typeDescription": null, Keep in mind that all the JSON elements shown above pertain
to the Atlas type which we plan to modify. Define the new Atlas Property For this example, we are adding a property called ‘DataOwner’
which we intend to contain the owner of the data from a governance
perspective. For our purposes, we have
the following requirements:
Requirement
Attribute Property
Assignment
The property is searchable
isIndexable
True
The property will contain a string
datatype
String
Not all Hive tables will have an owner
Multiplicity
Optional
A Data owner can be assigned to multiple Hive tables
isUnique
false
Based on the above requirements, we end up with a property
definition as shown below: {"name": "DataOwner","dataTypeName": "string","multiplicity": "optional","isComposite": false,"isUnique": false,"isIndexable": true,"reverseAttributeName": null}, When defining Atlas properties, you can as shown in the file, it is possible to define multiple properties at one time, so take your
time and try and define all of the properties at once. Make certain you include an existing Properties An annoying thing about the Atlas v1 REST api is the need to
include some of the other key properties in your JSON file. For this example, which was running on HDP
2.5.3 I had to define a bunch of properties.
And every time you add a new custom property it is necessary to include
those custom properties in your JSON. If
you check out the file JSON file used for this example you will find a
long list of properties which are required as of HDP 2.5.0. Step 2: PUT the Atlas property update We now have the full JSON request constructed with
our new property requirements. So it is
time to PUT the JSON file using the ATLAS REST API v1. For the text of this article I am using ‘curl’
to make the example clearer, though for the associated repo python is
used to make life a little easier. To execute the PUT REST request we will first need to
collect the following data elements:
Data Element
Where to find it
Atlas Admin User Id
This is a defined ‘administrative’ user for the Atlas
System. It is the same user id which
you use to log into Atlas.
Atlas Password
The password associated with Atlas Admin User Id
Atlas Server
The Atlas Metadata Server.
This can be found by selecting the Atlas server from Ambari and then
looking in the summary tab.
Atlas Port
It is normally 21000.
Check the Ambari Atlas configs for the specific port in your cluster
Update_hive_table_type.json
This is the name of the JSON file containing our new Atlas
property definition
curl -ivH -d @update_hive_table_type.json
--header "Content-Type: application/json" -u {Atlas Admin User Id}:{Atlas Password} -X PUT http://{Atlas Server}:{Atlas Port}/api/atlas/types If all is successful, then we should see a result like that
which is shown below. The only thing you
will need to verify in the result (other than the lack of any reported errors)
is that then “name” element is the same as the Atlas type to which you are adding
a new custom property. {
"requestId": "qtp1177377518-235-fcf1c6f4-5993-49ac-8f5b-cdaafd01f2c0",
"types":
[ {
"name": "hive_table"
} ]} However, if you are like me, then you probably will make a
couple of mistakes along the way. To
help you identify root cause for your errors, here is a short list of errors
and how to resolve them: Error #1: Missing a necessary Atlas property for the Type An error encountered like shown below is because your JSON
with the new custom property is missing an existing property. { "error":
"hive_table can't be updated - Old Attribute stats:numRows is
missing",
"stackTrace":
"org.apache.atlas.typesystem.types.TypeUpdateException: hive_table can't
be updated - Old Attribute stats:numRows is missing\n\tat The solution to fix this problem is to add that property along
with your custom property in your JSON file.
If you are uncertain as to the exact definition for the property, then
execute the execute Atlas REST API GET call as shown below to list out the
Atlas Type you are currently modifying properties: curl -H –u
{Atlas Admin User id}:{Atlas password}-X GET http://{Atlas
Server}/api/atlas/types Error #2: Unknown datatype: An error occurred like the one below: { "error":
"Unknown datatype: XRAY",
"stackTrace":
"org.apache.atlas.typesystem.exception.TypeNotFoundException: Unknown In this case, you have entered an incorrect Atlas Data Type. The allowed for data types include: byte short int long float double biginteger bigdecimal date string {custom types} The {custom types} enables you to reference another Atlas
type. So for example you decide to
create a ‘SecurityRules’ Atlas data type which itself contains a list of
properties, you would just insert the SecurityRules type name as the property. Error #n: Added incorrectly a new Atlas property for a type and you need to
delete it This is the reason why you ALWAYS want to modify Atlas Types
and Properties in a Sandbox developer region.
DO NOT EXPERIMENT WITH CUSTOMING ATLAS TYPES IN PRODUCTION!!!!! If you
ignore this standard approach in most organizations SLDC, your solution is to
delete the Atlas Service from within Ambari, re-add the service and then re-add
all your data. Not fun. Step 3: Check out the results As we see above, our new custom Atlas ‘hive_table’ property
is now visible in the Atlas UI for all tables.
As the property was just defined for all ‘hive_table’ data assets the
value is null. Your next step which is
covered in the Article Modify Atlas Entity properties using REST API commands is to assign a value the new property. Bibliography Atlas Rest API Atlas Technical User Guide Atlas REST API Search Techniques Modify Atlas Entity properties using REST API commands
... View more
Labels:
09-16-2018
12:41 AM
This article is really very useful but has a silly but confusing (specially for HDP newbies) error where all occurrences of "Ranger user id" and "Ranger Admin Server" must be replaced by "Atlas User ID" and "Atlas Admin Server" respectively.
... View more
10-26-2016
01:12 AM
The article: Modify Atlas Entity properties using REST API commands contains a full description for how to update both the comment and description entity properties for Atlas managed hive_table types.
... View more
06-23-2017
11:54 AM
@mjohnson Thanks for detailed explanation on updating entities. I have a query in the command you used to updated the description of entity. The command you used to update the description doesn't contain the actual string that needs to be replaced. Do we need to add it in the command while executing? something like the below http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description:"I get my answers from HCC" Thanks
... View more
08-04-2016
10:37 PM
2 Kudos
Hive Streaming Compaction
This is the second part of the Hive Streaming Article series. In this article we will review the issues around compacting Hive Streaming files.
One of the results of ingesting data through Hive streaming is the creation of many small 'Delta' files. Left uncompacted you could run the risk of running into NameNode capacity problems. Fortunately, compaction functionality is part of Hive Streaming. The remainder of this Article reviews design considerations as well as commands necessary to enable and control compaction for your Hive tables.
Hive Compaction Design considerations
The Compaction process has a set of cleaner processes running in the
background during the ingest process looking for opportunities to
compact the delta files based on the rules you specify.
The first thing to keep in mind is that there are two forms of
Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just
consolidate the delta files. This approach does not have to worry about
consolidating all of the delta files along with a large set of base
bucket files and is thus the least disruptive to the system resources.
‘major’ compaction consolidates all of the delta files just like the
‘minor’ compaction and in addition it consolidates the delta files with
the base to produce a very clean physical layout for the hive table.
However, major compactions can take minutes to hours and can consume a
lot of disk, network, memory and CPU resources, so they should be
invoked carefully.
To provide greater control over the compaction process and avoid
impacting other processes in addition to the compactor configuration
options available, it is also possible to invoke compaction
automatically by the cleaner threads or manually initiated when system
load is low.
The primary compaction configuration triggers to review when
implementing or tuning your compaction processes are:
hive.compactor.initiator.on
hive.compactor.cleaner.run.interval
hive.compactor.delta.num.threshold - Number of delta directories in
a table or partition that will trigger a minor compaction.
hive.compactor.delta.pct.threshold - Percentage (fractional) size of
the delta files relative to the base that will trigger a
major compaction. 1 = 100%, so the default 0.1 = 10%.
hive.compactor.abortedtxn.threshold - Number of aborted transactions
involving a given table or partition that will trigger a major
compaction
A Hive Compaction Manual example
In our example we have turned off major compaction as it should only run
during off load periods. We take a look at the delta files for our table
in hdfs and see that there are over 300 delta files and 5 base files.
[hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest/_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500
-rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2123501_2133500
-rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2133501_2143500
-rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2143501_2153500
-rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2153501_2163500
A decision has been been made to run the major compaction manually
during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT
‘major’” command to place the compaction job into the queue for
processing. A compaction resource management queue was defined with a
limited quota resource, so the compaction will not impact other jobs.
hive> alter table acidtest compact 'major';
Compaction enqueued.
OK
Time taken: 0.037 seconds
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
default acidtest NULL MAJOR working server2.hdp-26 1459100244000
Time taken: 0.019 seconds, Fetched: 2 row(s)
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
Time taken: 0.016 seconds, Fetched: 1 row(s)
hive>;
The outstanding table compaction jobs are visible by executing the
command line “SHOW COMPACTIONS as illustrated in the example above. Or
the ‘major’ compaction is also visible through the Applications history
log. After the ‘major’ compaction has completed, all of the delta files
available at the time the compaction was initiated will have rolled up
into the ‘base’ tables.
[hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500
-rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00002
[hive@server1 ~]$
The end result of this example is that 305 consolidated to just 5 files.
While 300 files will not impact the NameNode performance, it will most
likely improve query performance as the Hive engine will have fewer
files to scan to execute the query.
Bibliography
Hopefully, the example and source code supplied with this blog posting
are sufficient to get you started with Hive Streaming and avoid
potential problems. In addition to this blog posting some other
resources which are useful references include:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
http://hortonworks.com/blog/adding-acid-to-apache-hive/
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-transactions.html
http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a
... View more
Labels:
06-09-2016
04:07 PM
1 Kudo
Nice table -- what version of HDP did you base it on?
... View more
11-29-2016
12:36 AM
A couple of comments: 1. The section on setting up a dual homed network is correct, but misleading. Most people who set up dual-homed networks would expect to spread at least some of the load over the interfaces, but Hadoop code is just not network aware in that sense. So it is *much* better to use bonding/link aggregation for network redundancy. 2. In this day and age, don't even think about using 1Gb ports. Use at least 2 10Gb ports. Cloud providers are *today* installing 50Gb networks to their servers - 2x25Gb or 1x50Gb. You're wasting a LOT of CPU if you don't give them enough bandwidth.
... View more