Member since
09-24-2015
32
Posts
60
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1458 | 02-10-2017 07:33 PM | |
1759 | 07-18-2016 02:14 PM | |
4565 | 07-14-2016 06:09 PM | |
18975 | 07-12-2016 07:59 PM |
02-10-2017
07:33 PM
1 Kudo
There are two ways to setup an existing taxonomy into Atlas; 1.) If you have Hive, then run the import-hive.sh utility. This utility scans all of the hive tables and creates the matching Atlas entries. For 2.5.3, you will find the utility at ./2.5.3.0-37/atlas/hook-bin/import-hive.sh 2.) You can also add entries and update properties using the REST API. One article describing this is available at: https://community.hortonworks.com/content/kbentry/74064/add-custom-properties-to-existing-atlas-types-in-s.html
... View more
02-06-2017
06:59 PM
4 Kudos
Overview
Atlas provides powerful
Tagging capabilities which Data Analysts to identify all data sets containing
specific types of data. The Atlas UI
itself provides a powerful Tag based search capability which require no REST
API interaction. However, for those of
you out there who need to integrate Tag based search with some of their data
discovery and governance activities, this posting is for you. Within this posting are some instructions
regarding how you can use the Atlas REST API to retrieve entity data based on a
TAG name.
Before getting too deep into the Atlas Tag search examples it is important to recognize that Atlas Tags are basically a form of an Atlas type. If you invoke the REST API command “/api/atlas/types”, in the summary output below interspersed between standard Atlas types such as ‘hive_table’, ‘jms_topic’, etc., will be the current set of user defined Atlas Tags (CUSTOMER & SALES) as shown below: "count": 35,
"requestId": "qtp1177377518-81 - c7d4a853-02a0-4a1e-9b50-f7375f6e5f08",
"results": [
"falcon_feed_replication",
"falcon_process",
"DataSet",
"falcon_feed_creation",
"file_action",
"hive_order",
"Process",
"hive_table",
"hive_db",
…
"Infrastructure",
"CUSTOMER",
"Asset",
"storm_spout",
"SALES",
"hive_column",
…
]
In the rest of the article we will expand on the Atlas types API to explore how we can perform two different types of TAG based searches. Before going too far it is important to note that the source code for the following examples are available through this repo. Tag Search Example #1: Simple REST based Tag Based Search example
In our first Tag search example our objective is to return a
list of Atlas Data Entities which have the query TAG name assigned. In this example, we are going to search our atlas
instance on (‘server1’ port 21000) for all Atlas entities with a tag named
CUSTOMER. You will want to replace
CUSTOMER with an existing tag on your system.
Our Atlas DSL query to find the CUSTOMER tag using the ‘curl’
command is as shown below: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=CUSTOMER
The example above returns a list of the entity guids which
have the Atlas Tag ‘CUSTOMER’ defined to the Atlas host ‘server1’ on port
21000. To run this query on your own
cluster or on a sandbox just substitute the Atlas Server Host URL, Atlas Server
Port number, login information and your Tag name and then invoke as shown above
with curl (or SimpleAtlasTagSearch.py in the Python example in the referenced
Repo at the end of this article).
An output from this REST API query on my cluster is shown
below: {
"count":
2,
"dataType": {
"attributeDefinitions": [
…
],
"typeDescription": null,
"typeName": "__tempQueryResultStruct120"
},
"query":
"CUSTOMER",
"queryType": "dsl",
"requestId": "qtp1177377518-81 -
624fc6b9-e3cc-4ab7-80ba-c6a57d6ef3fd",
"results":
[
{
"$typeName$": "__tempQueryResultStruct120",
"instanceInfo": {
"$typeName$": "__IdType",
"guid": "806362dc-0709-47ca-af16-fac81184c130",
"state": "ACTIVE",
"typeName":
"hive_table"
},
"traitDetails": null
},
{
"$typeName$": "__tempQueryResultStruct120",
"instanceInfo": {
"$typeName$": "__IdType",
"guid":
"4138c963-b20d-4d10-b338-2c334202af43",
"state": "ACTIVE",
"typeName": "hive_table"
},
"traitDetails": null
}
]
}
The results from this query can be thought of having 3
sections:
results header where you can find the results
count
Returned DataTypes
Results (list of entity guids)
For our purposes we are really only interested in the list
of entities, so all you need to do is focus on extracting the important
information from the .results jsonpath object in the return json object. Looking at the results section we observe
that only one entity has the CUSTOMER tag assigned. This entity located by the search has the
guid assigned of ‘4138c963-b20d-4d10-b338-2c334202af43’ we see is an active
entity (not deleted). We can now use the
entity search capabilities to retrieve the actual entity as described in the
next example within this article. Example #2: Returning details on all entities based on Tag assignment
The beauty of Example #1 is we can build an entity list
using a single REST API call. However,
for the real world we will want access to details about the assigned
entities. To accomplish this, we will
need a programming interface such as Python, Java, Scala, bash what your
favorite tool is, etc. to pull the GUIDs and then perform entity searches.
For the purposes of this posting, we will use Python to
illustrate how to perform more powerful Atlas Tag searches. The example below performs two Atlas REST API
queries to build a json object containing the details and not just guids for
the entities with our Tag assigned. def atlasGET( restAPI ) :<br>
url = "http://"+ATLAS_DOMAIN+":"+ATLAS_PORT+restAPI<br>
r= requests.get(url, auth=("admin", "admin"))return(json.loads(r.text));
results = atlasGET("/api/atlas/discovery/search/dsl?query={0}".format(TAG_NAME))
entityGuidList = results['results']
entityList = [] for entity in entityGuidList:
guid = entity['instanceInfo']['guid']
entityDetail = atlasGET("/api/atlas/entities/{0}".format(guid))
entityList.append(entityDetail);
print json.dumps(entityList, indent=4, sort_keys=True)
The output from this script is now available for more
sophisticated data governance and data discovery projects. Atlas Tag Based Search Limitations
As powerful as both the Atlas UI and Atlas REST API Tag
based searches are, there are some limitations to be aware:
Atlas supports only searching on one TAG at a
time.
It is impossible to include other entity
properties in the TAG searches
The Atlas REST API used for TAG searches can
only return a list of GUIDs.
It is not possible to search for TAG attributes
... View more
Labels:
12-23-2016
05:01 PM
Overview Data Governance is unique for each organization and every
organization needs to track a different set of properties for their data
assets. Fortunately, Atlas provides the
flexibility to add new data asset properties to support your organization’s
data governance requirements. The objective
for this article is to describe the steps utilizing the Atlas REST API to add
new Atlas properties to your Atlas Types. Add a new Property for an existing Atlas Type To simplify this article, we will focus in on the 3 steps
required to add and enable for display a custom property to the standard Atlas
property ‘hive_table’. Following these
steps, you should be able to modify the ‘hive_table’ Atlas Type and add custom
properties which are available to enter values, view in the Atlas UI and search. To make the article easier to read the JSON file is shown in
small chunks. To view the full JSON file
as well as other files used to research for this article, check out this repo. Step 1: Define the custom property JSON The most import step of this process is properly defining
the JSON used to update your Atlas Type.
There are three parts to the JSON object we will pass to Atlas; The header – contains the type identifier and
some other meta information required by Atlas The actual new property definition The required existing Atlas type properties Defining the Header Frankly, the header is just standard JSON elements which get
repeated every time you define a new property.
The only change we need to make to the header block shown below for each
example is to get the ‘typeName’ JSON element properly set. In our case as shown below we want to add a
property defined for all Hive tables so we have correctly defined the typeName
to be ‘hive_table’. {"enumTypes": [],"structTypes": [],"traitTypes": [],"classTypes": [
{"superTypes": ["DataSet"],"hierarchicalMetaTypeName": "org.apache.atlas.typesystem.types.ClassType","typeName": "hive_table","typeDescription": null, Keep in mind that all the JSON elements shown above pertain
to the Atlas type which we plan to modify. Define the new Atlas Property For this example, we are adding a property called ‘DataOwner’
which we intend to contain the owner of the data from a governance
perspective. For our purposes, we have
the following requirements:
Requirement
Attribute Property
Assignment
The property is searchable
isIndexable
True
The property will contain a string
datatype
String
Not all Hive tables will have an owner
Multiplicity
Optional
A Data owner can be assigned to multiple Hive tables
isUnique
false
Based on the above requirements, we end up with a property
definition as shown below: {"name": "DataOwner","dataTypeName": "string","multiplicity": "optional","isComposite": false,"isUnique": false,"isIndexable": true,"reverseAttributeName": null}, When defining Atlas properties, you can as shown in the file, it is possible to define multiple properties at one time, so take your
time and try and define all of the properties at once. Make certain you include an existing Properties An annoying thing about the Atlas v1 REST api is the need to
include some of the other key properties in your JSON file. For this example, which was running on HDP
2.5.3 I had to define a bunch of properties.
And every time you add a new custom property it is necessary to include
those custom properties in your JSON. If
you check out the file JSON file used for this example you will find a
long list of properties which are required as of HDP 2.5.0. Step 2: PUT the Atlas property update We now have the full JSON request constructed with
our new property requirements. So it is
time to PUT the JSON file using the ATLAS REST API v1. For the text of this article I am using ‘curl’
to make the example clearer, though for the associated repo python is
used to make life a little easier. To execute the PUT REST request we will first need to
collect the following data elements:
Data Element
Where to find it
Atlas Admin User Id
This is a defined ‘administrative’ user for the Atlas
System. It is the same user id which
you use to log into Atlas.
Atlas Password
The password associated with Atlas Admin User Id
Atlas Server
The Atlas Metadata Server.
This can be found by selecting the Atlas server from Ambari and then
looking in the summary tab.
Atlas Port
It is normally 21000.
Check the Ambari Atlas configs for the specific port in your cluster
Update_hive_table_type.json
This is the name of the JSON file containing our new Atlas
property definition
curl -ivH -d @update_hive_table_type.json
--header "Content-Type: application/json" -u {Atlas Admin User Id}:{Atlas Password} -X PUT http://{Atlas Server}:{Atlas Port}/api/atlas/types If all is successful, then we should see a result like that
which is shown below. The only thing you
will need to verify in the result (other than the lack of any reported errors)
is that then “name” element is the same as the Atlas type to which you are adding
a new custom property. {
"requestId": "qtp1177377518-235-fcf1c6f4-5993-49ac-8f5b-cdaafd01f2c0",
"types":
[ {
"name": "hive_table"
} ]} However, if you are like me, then you probably will make a
couple of mistakes along the way. To
help you identify root cause for your errors, here is a short list of errors
and how to resolve them: Error #1: Missing a necessary Atlas property for the Type An error encountered like shown below is because your JSON
with the new custom property is missing an existing property. { "error":
"hive_table can't be updated - Old Attribute stats:numRows is
missing",
"stackTrace":
"org.apache.atlas.typesystem.types.TypeUpdateException: hive_table can't
be updated - Old Attribute stats:numRows is missing\n\tat The solution to fix this problem is to add that property along
with your custom property in your JSON file.
If you are uncertain as to the exact definition for the property, then
execute the execute Atlas REST API GET call as shown below to list out the
Atlas Type you are currently modifying properties: curl -H –u
{Atlas Admin User id}:{Atlas password}-X GET http://{Atlas
Server}/api/atlas/types Error #2: Unknown datatype: An error occurred like the one below: { "error":
"Unknown datatype: XRAY",
"stackTrace":
"org.apache.atlas.typesystem.exception.TypeNotFoundException: Unknown In this case, you have entered an incorrect Atlas Data Type. The allowed for data types include: byte short int long float double biginteger bigdecimal date string {custom types} The {custom types} enables you to reference another Atlas
type. So for example you decide to
create a ‘SecurityRules’ Atlas data type which itself contains a list of
properties, you would just insert the SecurityRules type name as the property. Error #n: Added incorrectly a new Atlas property for a type and you need to
delete it This is the reason why you ALWAYS want to modify Atlas Types
and Properties in a Sandbox developer region.
DO NOT EXPERIMENT WITH CUSTOMING ATLAS TYPES IN PRODUCTION!!!!! If you
ignore this standard approach in most organizations SLDC, your solution is to
delete the Atlas Service from within Ambari, re-add the service and then re-add
all your data. Not fun. Step 3: Check out the results As we see above, our new custom Atlas ‘hive_table’ property
is now visible in the Atlas UI for all tables.
As the property was just defined for all ‘hive_table’ data assets the
value is null. Your next step which is
covered in the Article Modify Atlas Entity properties using REST API commands is to assign a value the new property. Bibliography Atlas Rest API Atlas Technical User Guide Atlas REST API Search Techniques Modify Atlas Entity properties using REST API commands
... View more
Labels:
10-26-2016
07:44 AM
9 Kudos
Overview The whole purpose of
that list is to be able to search through all the entities contained within
your data lake and identify specifically those entities which will lead your
analysis. A Common complaint with the idea of the data lake is that you load
too many files into the data lake and it becomes unwieldy. Atlas addresses this
problem by providing powerful search tools to identify all of the Data entities
located within the data lake. The purpose of this article is to explore the
four Entity search techniques available within Atlas. Atlas search options compared The following chart
summarizes the capabilities for each of the four atlas entity search techniques
explored in this article. The sub sections below in this article we’ll go into
more detail on how to set up the search techniques.
Attribute
Entity Search
Qualified Name
Search
DSL Search
Full Text Search
Can identify multiple entities
Yes
No
Yes
Yes
Supports filtering
No
Yes
Yes
Yes
Free text
No
No
No
Yes
Ability to search on sub-attributes
No
No
Yes
No
Primary Value
Listing all Entities of a given type
Retrieve a specific entity record
Complex searches of Entities
Locating records primarily based on name, Comment and
description fields
To simplify this article we are going to focus on just the ‘hive_table’
type, though keep in mind that for all the search examples covered here, you
can use any Atlas type. Preparations for the Search Examples To support the search
examples first were going to need to gather the following pieces of information:
Configuration Property
Description
Where to find it
Ranger user id
The user id used to log into the Ranger UI screen. This login will be for a Ranger
Administrative user with a default id of ‘admin’
Password
The matching password to the Ranger User id. For this article I will be using the Ranger
default password ‘admin’
Ranger Admin Server
The Administrative server for Ranger. For this article, I will be using the FQDN ‘server1.hdp’
and also assuming the default port 21000.
Go into Ambari, select the Ranger than QuickLinks. The host name for the Ranger UI is the
proper value for this parameter
GUID
A unique Entity identifier.
Found in many of this article’s search result responses.
Fully Qualified Table Name
A unique identifier for Hive tables.
Is created by concatenating the following: {database
name}.{table name}@{cluster Name}. So,
if you had a database called ‘transports’ and a table named ‘drivers’ on a
cluster named HDP, then the fully qualified table name would be ‘transports.drivers@HDP’.
For all examples contained within this article, to avoid
creating overly long result listings, you will often see a “…” marker in the
middle of a list. When you see this know
that my intent was to spare you excessive scrolling when reviewing the result
set. All of the examples are presented in this article using the
curl command as that is commonly available in Unix (Linux & Mac) based systems. For Python examples, you can find more
detailed examples through this articles matching GitHub
repository. Atlas Entity Search Example The Atlas Entity
Search technique is the simplest of all of those explored in this article. It’s entire purpose is to retrieve all Entities
of the specified type with no additional filtering enabled. To retrieve a JSON list containing all the entities you will
use the REST API command: curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}:21000/api/atlas/entities?type=hive_table Now let’s take a look at an actual example: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/entities?type=hive_table In this example, we see as shown below the response contains
two parts; (1) A header set containing an element count and RequestID, and (2)
A list of GUIDs associated with all the Hive tables known to Atlas. { "count": 26,
"requestId": "qtp1783047508-5870 - 867273a6-10e1-4c65-8da2-08a06b89d005",
"results": [
"d3d637c5-df7e-4311-adc3-bcc6c4b81fb1",
"cdbbd999-f789-4d2e-9127-b2443209b3b7",
"848c05fa-f2d9-4482-8892-d4b4fc137ee6",
…
"03e38f24-577a-45ae-b67f-54a3e34f34ce",
"4945f76b-6403-483a-b9aa-3161ce3e4bd6",
"90d76c28-911e-425b-845f-5e1096eed3bb",
"9571fb0e-52f5-4c16-a8f1-d2cf5138824c" ],
"typeName": "hive_table"
} Notice in the example above we find that there are 26 hive
tables. Now unfortunately this sort of output by itself is not useful, so this
API should be considered as one component in the entity retrieval. To actually see the entity full details you
would want to run the REST call below using one of the GUID from the above response: curl -iv -u admin:admin -X GET http://{Ranger
Admin Server}:21000/api/atlas/entities/{GUID} Atlas Qualified Name Search Often we will only
want to look at one end of the record. In these cases, we can use the atlas
qualified names search to retrieve a single entity Record. In Atlas for Hive
tables the qualified name represents the concatenation of the database name the
table name and finally the cluster team. curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}::21000/api/atlas/entities?type=hive_table&property=qualifiedName&value={Fully
Qualified Table Name} A result set from this query would cover all of the metadata
captured by Atlas for the Fully Qualified Table Name as shown below (sorry
showing it all so you will have to scroll): {
"definition": {
"id": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [
"TLC" ],
"traits": {
"TLC": {
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",
"typeName": "TLC",
"values": {} } },
"typeName": "hive_table",
"values": {
"aliases": null, "columns":
[ {
"id": {
"id": "1690ccc2-d7be-45af-becb-c6b360a1a30f",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null, "name":
"driverid",
"owner": "hive",
"qualifiedName": "default.drivers.driverid@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0 },
"type": "varchar(15)" } }, {
"id": {
"id": "249a7ce3-6b19-418e-9094-7d8a30bc596f",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [
"CARRIER" ],
"traits": {
"CARRIER": {
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",
"typeName": "CARRIER",
"values": {}
} },
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "companyid",
"owner": "hive",
"qualifiedName": "default.drivers.companyid@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(15)" } }, {
"id": {
"id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null, "name":
"customer",
"owner": "hive",
"qualifiedName": "default.drivers.customer@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(40)" } }, {
"id": {
"id": "143479a3-be79-4f04-b649-4a09b5429ace",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "drivername",
"owner": "hive",
"qualifiedName": "default.drivers.drivername@HDP", "table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(75)" } }, {
"id": {
"id": "6c3123a9-0d09-490b-840d-6cc012ab69e0",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE", "typeName":
"hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [], "traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "yearsdriving", "owner": "hive",
"qualifiedName": "default.drivers.yearsdriving@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "int" } }, {
"id": {
"id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "riskscore",
"owner": "hive",
"qualifiedName": "default.drivers.riskscore@HDP", "table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(5)" } } ],
"comment": null,
"createTime": "2016-10-11T17:11:11.000Z",
"db": {
"id": "332189cc-d994-44c2-8f87-29a28a471434",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_db",
"version": 0 },
"description": "\"I get my answers from
HCC\"",
"lastAccessTime": "2016-10-11T17:11:11.000Z", "name": "drivers",
"owner": "hive",
"parameters": {
"COLUMN_STATS_ACCURATE":
"{\"BASIC_STATS\":\"true\"}",
"EXTERNAL": "TRUE",
"numFiles": "1",
"numRows": "4278",
"rawDataSize": "1967880",
"totalSize": "68597",
"transient_lastDdlTime": "1476205880" },
"partitionKeys": null,
"qualifiedName": "default.drivers@HDP", "retention":
0,
"sd": {
"id": {
"id": "36166469-1014-4645-98a6-9df34b37a145",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_storagedesc",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference", "traitNames": [],
"traits": {},
"typeName": "hive_storagedesc",
"values": {
"bucketCols": null,
"compressed": false,
"inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
"location":
"hdfs://server1.hdp:8020/apps/hive/warehouse/drivers",
"numBuckets": -1,
"outputFormat":
"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
"parameters": null,
"qualifiedName": "default.drivers@HDP_storage",
"serdeInfo": {
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct", "typeName":
"hive_serde",
"values": {
"name": null,
"parameters": {
"serialization.format": "1"
},
"serializationLib":
"org.apache.hadoop.hive.ql.io.orc.OrcSerde"
} },
"sortCols": null,
"storedAsSubDirectories": false,
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE", "typeName":
"hive_table",
"version": 0 } } },
"tableType": "EXTERNAL_TABLE",
"temporary": false,
"viewExpandedText": null, "viewOriginalText": null } },
"requestId": "qtp1783047508-5870 -
4f0cc268-7fdc-49ac-a585-74f486ee3786" You thing you will immediately notice is the output, like
most JSON outputs is hierarchical. For
qualified name searches as well as most of the others, it is only possible to
search on the top levels. The one
exception to this are Atlas Entity DSL searches which are covered in the next
section. Atlas Entity DSL search Example The Atlas Entity DSL search is by far the most powerful of
the options covered in this article. The DSL search enables selection of Entities
based on a combination of properties, Atlas types, as well as sub properties within
the entity definition. In addition, the DSL search allows to select only those
properties desired for display so you don’t have to output monstrous documents. DSL Search Example #1: SIMPLE DSL SEARCH BASED ON TABLENAME ONLY This example is similar to the fully qualified table name
example described earlier in this article. The primary difference, as shown
below is the ability to search only on the table name and allow the data base
and cluster name to stay as wildcards. curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='{TableName}’ To make the query above work it must be properly
encoded. As shown below the ‘+’
character is used as the spacing between keywords. { "count":
7,
"dataType": {
"attributeDefinitions": [ {
"dataTypeName": "hive_db",
"isComposite": false,
"isIndexable": false, "isUnique":
false,
"multiplicity": {
"isUnique": false,
"lower": 1,
"upper": 1 },
"name": "db",
"reverseAttributeName": null }, … }, "query":
"hive_table where name='drivers'",
"queryType": "dsl",
"requestId": "qtp1783047508-19 -
79fcb168-d047-4b70-9129-adebba09b323",
"results": [ { … } In the sample output above, we see that the sample query
identified seven entities where are name is equal to tablename and the atlas
type is a hive table. In addition, in the result header we see a list of each
of the attributes output as part of the result set. Starting with the JSON
element “query”, we see the query and the query type used. And then at the very
end of the results set is a section titled called results which will output the
detail for each entity found much as we saw earlier in the qualified cable name
search in one of the sections above. DLS Search Example #2: Limiting the number of rows output There are several DSL
search options which control things such as row limit, order by among others.
This section Will show how they can be used through the use of the row limit
option as shown below: curl -iv -u {Ranger userId:Password} -X GET http://{Ranger
Admin Server}:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers'
limit 3 In the example above, we see the account of entities founded
and returned is equal to three which exactly matches the limit specified in the
query shown above. The contents of the attribute definitions as well as the
resulting Entity properties is the same as the other DSL searches covered in
this article. { "count":
3,
"dataType": {
"attributeDefinitions": [ {
"dataTypeName": "hive_db", "isComposite": false,
"isIndexable": false,
"isUnique": false,
"multiplicity": {
"isUnique": false,
"lower": 1,
"upper": 1 },
"name": "db",
"reverseAttributeName": null },… DSL Search Example #3: DSL SEARCH BASED ON TABLENAME ONLY DEMONSTRATING
DISPLAYING ONLY SELECT PROPERTIES As the prior examples
illustrate, The DSL search options can how put a lot of data, but much of it useless for your analyses. To constrain the
types of properties returned the atlas DSL search has the ‘select’ keyword to
limit the properties actually output. The following example illustrates how to
invoke the ‘select’ option. In the example below, the output is constrained to just the
properties “tableType, temporary, retention, qualifiedName, description, name, owner,
comment, createTime” and reduces the
response document output size: curl -iv -u admin:admin -X GET
http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table+where+name='drivers'
select
tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime
limit 1 The result from the query above is shown in the listing
below: { "count":
1,
"dataType": {
"attributeDefinitions": [ {
"dataTypeName": "string",
"isComposite": false,
"isIndexable": false,
"isUnique": false,
"multiplicity": { "isUnique": false,
"lower": 0,
"upper": 1 },
"name": "tableType",
"reverseAttributeName": null }, … }, "query":
"hive_table where name='drivers' select tableType,temporary,retention,qualifiedName,description,name,owner,comment,createTime
limit 1",
"queryType": "dsl",
"requestId": "qtp1783047508-21 -
347d7796-cce3-4eb6-8a40-59cedc07433a",
"results": [ {
"$typeName$": "__tempQueryResultStruct48",
"comment": null,
"createTime": "2016-10-11T16:50:03.000Z",
"description": "\"Try this value again\"",
"name": "drivers",
"owner": "hive",
"qualifiedName": "default.drivers@HDP",
"retention": 0,
"tableType": "EXTERNAL_TABLE",
"temporary": false } ]} In the example query above, only one entity was identified
and as you can see in the results block only those property names specified
after the select option art display rather than the long hierarchically deep
output we’ve seen in prior examples. DSL Search Example #4: DSL SEARCH FOR ENTITIES CONTAINING COLUMN NAME As nice as limiting
output to just the first-order fields may seem, often it is necessary to query
on some arrays of sub-properties in the entity definition. One complex property often needed in data
discovery queries in the ‘column’ property.
If you want to search for all hive_tables where a specific field resides,
possibly one used often as a foreign key, then you should check out this
example: curl -iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/dsl?query=hive_table%2C+columns+where+name%3D%27tweet_id%27 In the DSL query above, you will see the encoded string %2C
(‘,’) following after the data type.
That comma is indicating that we want to look at the values assigned to
the property name immediately following…which in this case is the property
named ‘columns’. Specifically, the
example above is requesting all Hive tables in which there exists a column
named ‘tweet_id’. The use of the ’+’ to
separate keywords is just as it has been for the other dsl queries, however,
for the query of the sub-property it must be entirely encoded as you can see in
the example above. The result from the above example is: { "count":
1,
"dataType": {
"attributeDefinitions": [ {… ], "hierarchicalMetaTypeName":
"org.apache.atlas.typesystem.types.ClassType",
"superTypes": [
"DataSet" ],
"typeDescription": null,
"typeName": "hive_column" }, "query":
"hive_table, columns where name='tweet_id'",
"queryType": "dsl",
"requestId": "qtp1783047508-5868 -
3db60f55-8e7b-409b-9a6f-2abea20b8371",
"results": [ {
"$id$": {
"$typeName$": "hive_column",
"id": "b32fc0ab-2f66-4ad1-9728-ccd1d48dbf32",
"state": "ACTIVE",
"version": 0 },… As we see in the output above, only 1 entity was found with
a columname equal to ‘tweet_id’. You
will also note in the example above that the output is the exact same for all
of the hive_table types covered so far in this article. If you wanted to you could make this query
more powerful to return a limit of values and only include a select set of properties
in the results output. Atlas Full Text Search Example Our last type of query will free search the top level entity
properties for the query string passed. The
example below for example, is looking for any top level property which contains
the string ‘sku’. curl –iv -u admin:admin -X GET http://server1:21000/api/atlas/discovery/search/fulltext?query=sku The search results for this query option are similar to the
entity list option explored at the start of this article in that it only lists
the entitys’ GUID where the query is equal to the text supplied (‘sku’ in this example). One other interesting output property is the
score. The ‘score’ property is an
attempt to evaluate the search output quality as you find in many search
engines. The output below shows a sample
response from the query above: { "count":
4, "query":
"sku",
"queryType": "full-text",
"requestId": "qtp1783047508-5868 -
5270285e-5bf4-43e5-82ed-46dc8901c461",
"results": [ {
"guid": "d37cfeab-afbc-41d0-8dda-71da29043bc3",
"score": 0.7805047, "typeName":
"hive_process" }, {
"guid": "b9b7dae2-775b-4c95-82b6-9d0ab2097a2c",
"score": 0.65042055,
"typeName": "hive_process" }, {
"guid": "c22f07fd-7522-4f37-97eb-96e3a9bc16bc",
"score": 0.6008328,
"typeName": "hive_column" }, {
"guid": "f23097c3-d836-4543-99d6-b08f6cdcc97a",
"score": 0.6008328,
"typeName": "hive_column" } ]} Bibliography:
Atlas Search
Examples in Python Atlas REST Search API
... View more
Labels:
10-26-2016
01:12 AM
The article: Modify Atlas Entity properties using REST API commands contains a full description for how to update both the comment and description entity properties for Atlas managed hive_table types.
... View more
10-19-2016
01:22 PM
4 Kudos
Overview
This article reviews the steps necessary to update Hive entities within Atlas the Description and Comment fields. The 0.70 Atlas release will display and allow text searches on the ‘‘description’ field, but the Atlas UI does not at this time support the ability to manually enter those properties into a given data Asset.
Examined in this article includes:
Searching for a Hive_Table entity
Update a single property for the Hive_Table entity definition (“description”) The Problem:
In release 0.70, Atlas has the ability to monitor additions
as well as changes to Hive table and Hive columns. When Atlas identifies a new entry or change
the appropriate Metadata property is
updated for that entity. One very cool
aspect to Atlas is the ability to conduct either DSL or free text searches on
any properties set for the entity.
Anyone trying to identify datasets to support a specific analytic
activity will definitely appreciate the ability search through all of the
entities and quickly discover valuable data assets in the data lake without having
to relying on tribal knowledge.
For this Article we will update a specific table based on
its full qualified name and then assign a new description field to the table. The full source code for the examples covered
in this article on
GitHub. The code for this example is written in Python and there is a full set of instructions in the repository README.md file. Locating the Entity whose properties require updating
Now let’s assume that in our ‘HDP’ cluster within the ‘default’
database there exists a table named ‘drivers’.
For this table, our objective is to change the ‘description’ property
from its current value to a value of ‘I get my answers from HCC’. Entity property updates are made one at a
time, so our first step is to collect the Guid for our target table.
As this article is about the update of a property within an Hive_table Entity, we will limit the
search coverage to identifying a unique Hive_table. The query values for this example are:
Property
Value used in this article
Comments on how to change the provided values for your cluster.
Atas server FQDN
server1.hdp
Use your server's Atlas Metadataserver FQDN
entityType
hive_table
Can be any valid Atlas Type
database name
default
Specify your table's database name.
table name
drivers
This can be any Hive Table whose metadata is already in Atlas. The table name you provide must already exist on your specified cluster.
Cluster name
HDP
The name of your cluster
An Atlas entity can be any variety of types. The beauty of this architecture is the same
search steps are available whether seeking a table, a hive column, or some
other Atlas managed type. The format we
will use for this search example is:
HTTP://{Atlas server FQDN}:21000/api/atlas/entities?type={entitytype}&property=qualifiedName&value={databasename}.{table name}@{Cluster name}
So for our example, the exact REST query would be:
http://server1.hdp:21000/api/atlas/entities?type=hive_table&property=qualifiedName&value=default.drivers@HDP
The full result as shown below from this REST query will contain
the guid necessary for the update along with all of the hive_table’s metadata
information as shown below:
{
"definition": {
"id": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [
"TLC" ],
"traits": { "TLC": {
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",
"typeName": "TLC",
"values": {} } },
"typeName": "hive_table", "values":
{
"aliases": null,
"columns": [ {
"id": {
"id": "1690ccc2-d7be-45af-becb-c6b360a1a30f",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "driverid",
"owner": "hive",
"qualifiedName": "default.drivers.driverid@HDP", "table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(15)" } }, {
"id": {
"id": "249a7ce3-6b19-418e-9094-7d8a30bc596f",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE", "typeName":
"hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [ "CARRIER" ],
"traits": {
"CARRIER": {
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",
"typeName": "CARRIER",
"values": {}
} },
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "companyid",
"owner": "hive",
"qualifiedName": "default.drivers.companyid@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(15)" } }, {
"id": {
"id": "d3b9557a-5ad0-4585-a9af-e1fed24569fc",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column", "values": {
"comment": null,
"description": null,
"name": "customer",
"owner": "hive",
"qualifiedName": "default.drivers.customer@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(40)" } }, {
"id": {
"id": "143479a3-be79-4f04-b649-4a09b5429ace",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null,
"description": null,
"name": "drivername",
"owner": "hive",
"qualifiedName": "default.drivers.drivername@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(75)" } }, {
"id": {
"id": "6c3123a9-0d09-490b-840d-6cc012ab69e0",
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_column",
"values": {
"comment": null, "description": null,
"name": "yearsdriving",
"owner": "hive",
"qualifiedName": "default.drivers.yearsdriving@HDP",
"table": { "id":
"b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "int" } }, {
"id": {
"id": "a419ed9f-df56-41cc-90bc-1c00a4d3c428",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_column",
"version": 0 },
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {}, "typeName":
"hive_column",
"values": {
"comment": null,
"description": null,
"name": "riskscore",
"owner": "hive", "qualifiedName":
"default.drivers.riskscore@HDP",
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0
},
"type": "varchar(5)" } } ],
"comment": null,
"createTime": "2016-10-11T17:11:11.000Z",
"db": {
"id": "332189cc-d994-44c2-8f87-29a28a471434",
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_db",
"version": 0 }, "description":
"\"changeMe\"",
"lastAccessTime": "2016-10-11T17:11:11.000Z",
"name": "drivers",
"owner": "hive",
"parameters": {
"COLUMN_STATS_ACCURATE":
"{\"BASIC_STATS\":\"true\"}",
"EXTERNAL": "TRUE",
"numFiles": "1",
"numRows": "4278",
"rawDataSize": "1967880",
"totalSize": "68597",
"transient_lastDdlTime": "1476205880" },
"partitionKeys": null,
"qualifiedName": "default.drivers@HDP",
"retention": 0,
"sd": {
"id": {
"id": "36166469-1014-4645-98a6-9df34b37a145",
"jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_storagedesc",
"version": 0 }, "jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference",
"traitNames": [],
"traits": {},
"typeName": "hive_storagedesc",
"values": {
"bucketCols": null,
"compressed": false,
"inputFormat":
"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
"location":
"hdfs://server1.hdp:8020/apps/hive/warehouse/drivers",
"numBuckets": -1,
"outputFormat":
"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
"parameters": null,
"qualifiedName": "default.drivers@HDP_storage",
"serdeInfo": { "jsonClass":
"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",
"typeName": "hive_serde",
"values": {
"name": null,
"parameters": {
"serialization.format": "1"
},
"serializationLib":
"org.apache.hadoop.hive.ql.io.orc.OrcSerde"
} },
"sortCols": null,
"storedAsSubDirectories": false,
"table": {
"id": "b78b5541-a205-4f9e-8b81-e20632a88ad5",
"jsonClass": "org.apache.atlas.typesystem.json.InstanceSerialization$_Id",
"state": "ACTIVE",
"typeName": "hive_table",
"version": 0 } } },
"tableType": "EXTERNAL_TABLE",
"temporary": false,
"viewExpandedText": null,
"viewOriginalText": null } },
"requestId": "qtp511473681-34831 -
b088be5b-44e6-4a2c-bd4a-7beeb059cf4f"}
In the result set above, locate the "id" property value which is the GUID and the "description" property with the current value of "changeMe".
In this case we will use the REST query results definition.id.id
value of ‘b78b5541-a205-4f9e-8b81-e20632a88ad5’ to support our next REST query
to update the property value. We can
also see in the ‘description’ field which is highlighted in bold currently has
the value of “changeMe”. Updating an Entities Property value
Now that we have the GUID, it is time to update the ‘description’
property from ‘changeMe’ to ‘I get my answers from HCC’. The update entity property REST command requires the GUID from the prior search step. To update the property, we will use the POST entity Atlas REST Command rolling the url query format and include the string "I get my answers from HCC" in the post message payload: http://{Atlas
server FQDN}:21000/api/atlas/entities/{GUID from prior search operation}?property={atlas
property field name}
So to finish our example, with our payload containing the string "I get my answers from HCC", the actual query would be:
http://server1:21000/api/atlas/entities/b78b5541-a205-4f9e-8b81-e20632a88ad5?property=description
The result from the above command will be the current
Metadata definition for our drivers table in JSON format as shown below:
{…
"description": "\"I get my answers from
HCC\"",
"lastAccessTime": "2016-10-11T17:11:11.000Z",
"name": "drivers",
"owner": "hive",
"parameters": {
"COLUMN_STATS_ACCURATE":
"{\"BASIC_STATS\":\"true\"}",
"EXTERNAL": "TRUE",
"numFiles": "1",
"numRows": "4278",
"rawDataSize": "1967880",
"totalSize": "68597",
"transient_lastDdlTime": "1476205880"}
Now let's go take a look at the Atlas UI, and check on the description for the drivers table. As we see in the screen print below, the new description property value has been successfully changed:
Next Steps:
This article attempts to take a simple property change example to illustrate the techniques necessary to modify the Atlas Metadata for a given entity. After you have completely run through this example, so follow on activities to experiment with include:
Changing properties for different entity types such as Hive_column or any of the HBase types.
Attempt to change some of the other top level property fields.
Go through a list of Hive tables changing each 'description' property with the value from an external source. Resource Bibliography
Atlas Entity REST API:
Atlas Search grammar:
... View more
Labels:
08-08-2016
12:06 PM
@saswati sahu The link provided above (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing) by @Sindhu provides documentation for all aspects of standard open source (Hortonworks) Hive indexing. Can you let me know what are your specific concerns so we can try to more specifically address them?
... View more
08-04-2016
10:37 PM
2 Kudos
Hive Streaming Compaction
This is the second part of the Hive Streaming Article series. In this article we will review the issues around compacting Hive Streaming files.
One of the results of ingesting data through Hive streaming is the creation of many small 'Delta' files. Left uncompacted you could run the risk of running into NameNode capacity problems. Fortunately, compaction functionality is part of Hive Streaming. The remainder of this Article reviews design considerations as well as commands necessary to enable and control compaction for your Hive tables.
Hive Compaction Design considerations
The Compaction process has a set of cleaner processes running in the
background during the ingest process looking for opportunities to
compact the delta files based on the rules you specify.
The first thing to keep in mind is that there are two forms of
Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just
consolidate the delta files. This approach does not have to worry about
consolidating all of the delta files along with a large set of base
bucket files and is thus the least disruptive to the system resources.
‘major’ compaction consolidates all of the delta files just like the
‘minor’ compaction and in addition it consolidates the delta files with
the base to produce a very clean physical layout for the hive table.
However, major compactions can take minutes to hours and can consume a
lot of disk, network, memory and CPU resources, so they should be
invoked carefully.
To provide greater control over the compaction process and avoid
impacting other processes in addition to the compactor configuration
options available, it is also possible to invoke compaction
automatically by the cleaner threads or manually initiated when system
load is low.
The primary compaction configuration triggers to review when
implementing or tuning your compaction processes are:
hive.compactor.initiator.on
hive.compactor.cleaner.run.interval
hive.compactor.delta.num.threshold - Number of delta directories in
a table or partition that will trigger a minor compaction.
hive.compactor.delta.pct.threshold - Percentage (fractional) size of
the delta files relative to the base that will trigger a
major compaction. 1 = 100%, so the default 0.1 = 10%.
hive.compactor.abortedtxn.threshold - Number of aborted transactions
involving a given table or partition that will trigger a major
compaction
A Hive Compaction Manual example
In our example we have turned off major compaction as it should only run
during off load periods. We take a look at the delta files for our table
in hdfs and see that there are over 300 delta files and 5 base files.
[hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest/_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500
-rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2123501_2133500
-rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2133501_2143500
-rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2143501_2153500
-rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2153501_2163500
A decision has been been made to run the major compaction manually
during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT
‘major’” command to place the compaction job into the queue for
processing. A compaction resource management queue was defined with a
limited quota resource, so the compaction will not impact other jobs.
hive> alter table acidtest compact 'major';
Compaction enqueued.
OK
Time taken: 0.037 seconds
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
default acidtest NULL MAJOR working server2.hdp-26 1459100244000
Time taken: 0.019 seconds, Fetched: 2 row(s)
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
Time taken: 0.016 seconds, Fetched: 1 row(s)
hive>;
The outstanding table compaction jobs are visible by executing the
command line “SHOW COMPACTIONS as illustrated in the example above. Or
the ‘major’ compaction is also visible through the Applications history
log. After the ‘major’ compaction has completed, all of the delta files
available at the time the compaction was initiated will have rolled up
into the ‘base’ tables.
[hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500
-rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00002
[hive@server1 ~]$
The end result of this example is that 305 consolidated to just 5 files.
While 300 files will not impact the NameNode performance, it will most
likely improve query performance as the Hive engine will have fewer
files to scan to execute the query.
Bibliography
Hopefully, the example and source code supplied with this blog posting
are sufficient to get you started with Hive Streaming and avoid
potential problems. In addition to this blog posting some other
resources which are useful references include:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
http://hortonworks.com/blog/adding-acid-to-apache-hive/
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-transactions.html
http://www.slideshare.net/Hadoop_Summit/adding-acid-transactions-inserts-updates-a
... View more
Labels:
08-04-2016
10:37 PM
7 Kudos
The Hive Streaming API enables the near real-time data ingestion into Hive. This two part posting reviews some of the design decisions necessary to produce a health Hive Streaming ingest process from which you can in
a near real-time execute queries on the ingested data. Implementing a real-time Hive Streaming example Implementing a Hive Streaming data feed requires we make tradeoffs
between the load on the NameNode versus the business SLAs for low
latency data queries. Hive Streaming is able to work in a near real-time
basis through the creation of a new ‘delta’ file on a bucketed Hive
table with every table commit. For example, a Hive Streaming client is
committing 1 row every second (it can actually commit at much faster
rates), then after 1 minute there would exist 60 new delta files added o
HDFS. After a day, there would be 86,400 new delta files and 15.5
million after 6 months just on your streamed data. With this many new
HDFS Hive delta files, eventually the Hadoop cluster would encounter
NameNode problems. When implementing your Hive Streaming solution it is important to keep
in mind the following table list 4 of the most significant
implementation concerns.
Implementation Concern
Design Decisions to address concern
Need to reduce the latency between ingestion and availability for queries
Decrease the number of rows per commit. Consequence – there will most likely be more delta files created which will impact the NameNode. Reduce the column count on streamed table to reduce the amount of data moved around during compaction.
NameNode reaching capacity
Increase the number of rows per commit. Consequence Reduction in the number of delta files within HDFS at any point in time, but there could be a longer lag before the data will be accessible to Hive Queries. Based on resources availability schedule regular major compactions. Consequence – If there exists a lot of data to compact and limited cluster sources, it could interfere with the performance of other jobs. Consider reducing the bucket and/or partition counts defined to the table in order to reduce the number of temporary delta files stored in HDFS in between compactions.
Major compactions will consume too much of the cluster’s resources
Assign the compaction to its own resource manager queue. Consider reducing the bucket and/or partition counts defined to the table in order to reduce the number of file moves required during compaction. Run compactions on a scheduled or manual basis during off-hours rather than allow automatic execution.
Need to maximize per thread throughput
Increase the number of rows per commit operations. This recommendation may seem contradictory, but it is true that data throughput is maximized by committing more rows at one time and data availability is maximized by decreasing the number of committed rows at one time.
A Hive streaming example with the Hive API Let’s apply the above Hive Streaming design considerations in a simple
example to write 10 columns of data at a rate of over 1 row per 100ms
and see how the above concepts can apply to reality. Hive Streaming Required Configuration Settings The following table represents a list of settings available in Hive 1.2
(HDP 2.4) and which are required to support the ability to select Hive
Streaming for your tables and which will support our example’s
requirements.
Configuration
Value
Description
Hive.support.concurrency
true
Hive.enforce.bucketing
true
Ensures that hive correctly assigns the streamed row into the correct Hive bucket.
Hive.exec.dynamic.partition.mode
nonstrict
Allows for the dynamic creation of partitions. Be careful that the streamed data ingested a low cardinality.
Hive.txn.manager
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
The class to process the hive transactions necessary to support Hive streaming
Hive.compactor.initiator.on
True
???
Hive.compactor.worker.threads
1
Controlls the number of worker threads controlling the compaction process.
Hive.txn.timeout
Default 300 seconds
Sometimes it is necessary to increase the value to avoid a large number of aborted transactions when the cluster is heavily taxed.
Creating the destination table for the Hive stream Now that the system is configured to support Hive Streaming, it is time
to create a transactional Hive table. The Hive table requirements to
support Hive Streaming include:
Hive table must be stored in ORC format Hive table must be bucketed. If desired, the table may also support
partitioning along with the bucket definition. There must be sufficient disk temporary space to support compaction
operations Need to specify the table as ‘transactional’ CREATE TABLE acidtest (a INT, b1 STRING, b2 STRING,
b3 STRING,b4 STRING,b5 STRING,
b6 STRING,b7 STRING,b8 STRING,
b9 STRING)
CLUSTERED BY(a) INTO 4 BUCKETS
STORED AS ORC
tblproperties("transactional"="true")";
In our example above, we see a CREATE TABLE nearly identical to most of
the other Hive tables we have already created with one notable
exception; we are specifying a table property of “transactional” equals
to true. This property is necessary in order to tell Hive to utilize the
transactional features to support streaming in Hive. The BUCKET count of 4 was selected to keep the number of delta bucket
files to a minimum yet still support table JOINS to other tables such as
CUSTOMER_INFORMATION table with 8 buckets. As 4 is a multiple of the 8
buckets found in the JOINed table, we will also with this schema be able
to support real-time table BUCKET JOINs. In different experiments performed to research for this blog, it was
found that dropping the “transactional introduces inconsistent results.
The problem can be in some instances it will appear to work, so make
absolutely certain that the “transactional” property is defined to your
Hive streaming table. Establishing the Hive Connections Now that we have our destination Hive Streaming table created and
available to Hive, it is now time to start creating our Hive Streaming
client. For this blog posting to illustrate the concepts behind Hive
Streaming, we will use the base Hive Streaming API. But, be aware that
both Flume and Storm have components to support at a higher level the
Hive Streaming process. For the example the next sections will walk through the full code and
support files are available through in the
github
WriteHDFS.java available for your download. Our first step to create our streaming client is to define the
HiveEndPoint connection to the thrift service. HiveEndPoint("thrift://"+hostname+":9083", "HIVE_DATABASE","HIVE_TABLE_NAME", null); The hostname and port name to configure for our connection can be
retrieved from within Ambari by searching for the property value
‘hive.metastore.uris’, and should not be confused with the hiveserver2
hostname. Hive Streaming Writers Once the endpoint connection established we need to assign it to our
streaming writer and specify the columns in our destination table. At the time of this writing, there are 3 writer classes available:
DelimitedInputWriter (the writer used for this blog’s example) –
Outputs the column values as a delimited string where you can
specify the desired delimiter. StrictJsonWriter – Converts the JSON string passed to the writer
into an Object using JsonSerde. AbstractRecordWriter – A base class available to create your own
customer writers String[] colFlds = new String[]{"a", "b1", "b2", "b3", "b4","b5", "b6", "b7", "b8", "b9"};
writer = new DelimitedInputWriter(colFlds, ",", endPt); The DelimitedInputWriter which we are using for this example contains 3
parameters:
A String[] array containing a list of the column names defined
to the table. You need to make certain that these column names match
up with the CREATE TABLE columns or you could encounter some errors. The string delimiter between each field. In this example, we are
just doing a comma separated list. The HiveEndPoint connection reference Defining the Hive Transactions and writing to the stream Our next step is to open a collection of batch ids to use for the hive
streaming. Hive defaults to 1,000 open transactions at one time, though
this value may be specified by changing the hive property
‘hive.txn.max.open.batch’. Once there exists a pool of open
transactions, the ‘beginNextTransaction()’ operation is able to start a
transaction for Hive Streaming to write. Keep in mind that attempting to
call ‘beginNextTransaction()’ when there are no more remaining
transactions will produce an error, so it is useful to check ‘remaining
Transactions()’ is greater than 0 before trying to start a new
transaction. TransactionBatch txnBatch =
connection.fetchTransactionBatch(maxBatchGroups, writer);
txnBatch.beginNextTransaction();
for (int i = 0; i < writeRows; ++i) {
writeStream(i, connection, writer, txnBatch, **false**);
if (currentBatchSize < 1) {
System.out.println(">"+threadId
" Beginning Transaction Commit:"+i+" Transaction State:"
+txnBatch.getCurrentTransactionState());
writer.flush();
txnBatch.commit();
if (txnBatch.remainingTransactions() > 0) {
System.out.println(">"+threadId+" ->"
+i+" txnBatch transactions remaining:"
+ txnBatch.remainingTransactions());
txnBatch.beginNextTransaction();
currentBatchSize = maxBatchSize;
} else {
System.out.println(">"+threadId
+" Refreshing the transaction group count");
txnBatch = connection.fetchTransactionBatch(maxBatchGroups,writer);
txnBatch.beginNextTransaction();
currentBatchSize = maxBatchSize;
}
}
--currentBatchSize;
}
writer.flush();
txnBatch.commit();
txnBatch.close(); Some important implementation details to keep in mind include:
The data is not available for queries until after the flush()
and commit() operations have completed. When the fetchTransactionBatch operation executes the system opens
‘maxBatchGroups’ count transactions. In order to close them, make
certain to execute the txnBatch.close() at the end of Hive Stream
method as shown at the bottom of the example. Physical Data Changes as the Hive Streaming example executes To really understand how Hive Streaming works, it is useful to monitor
the effect each step of the flow on the HDFS physical storage layer.
There are two physical measures to look at as the Hive Streaming process
executes:
Count and size of the delta files Number of rows accessible to Hive Queries Looking at the physical data flow, there are 4 high level steps
described below which should clarify how Hive Streaming impacts the
processing flow. Step 1: After the table create and before the commit() operation: Right before the commit() operation and right after the flush, you will
start to see the delta files appear in the warehouse subdirectory for
your hive stream table. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 17:12 /apps/hive/warehouse/acidtest/_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:12 /apps/hive/warehouse/acidtest/delta_3238501_3239500
-rw-r--r-- 3 mjohnson hdfs 2792 2016-03-27 17:12 /apps/hive/warehouse/acidtest/delta_3238501_3239500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:12 /apps/hive/warehouse/acidtest/delta_3238501_3239500/bucket_00000_flush_length We can see that in bucket #0 2792 bytes of temporary data has been
written to HDFS. Though at this point in time because the commit() has
not run, you will see 0 record count upon executing a select count(*)
from actidtest . Step 2: Immediately after the ‘commit()’ instruction However, as soon as the commit has completed, then we find that the
‘select count(*)…’ now returns 10,0001 rows as being part of the
acidtest table. There are also 999 transactions remaining, so we do not
need to call the method connection.fetchTransactionBatch(maxBatchGroups, writer); yet. Step 3: ‘maxBatchGroups’ commit() operations have completed, and no
more open transactions Now that the maxBatchGroups (generally 1,000) commit() operations have
completed and there are no more remainingTransactions, it is now
necessary to pull an additional group of Transaction batches by calling
‘‘connection.fetchTransactionBatch( maxBatchGroups, writer);’ again.
If we execute the command ‘SHOW TRANSACTIONS;’ before and after calling
‘fetchTransactionBatch’ we would see that the total number of OPEN
transactions would increase by roughly the amount of the
‘maxBatchGroups’ variable value. As the stream continues to ingest data you will notice that the simple
delta file list displayed below now has duplicated references to the
same bucket (bucket #0). [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 17:35 /apps/hive/warehouse/acidtest/_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3296501_3297500
-rw-r--r-- 3 mjohnson hdfs 1521 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3296501_3297500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3296501_3297500/bucket_00000_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3297501_3298500
-rw-r--r-- 3 mjohnson hdfs 1545 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3297501_3298500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3297501_3298500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3298501_3299500
-rw-r--r-- 3 mjohnson hdfs 1518 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3298501_3299500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3298501_3299500/bucket_00000_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3299501_3300500
-rw-r--r-- 3 mjohnson hdfs 1519 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3299501_3300500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 8 2016-03-27 17:37 /apps/hive/warehouse/acidtest/delta_3299501_3300500/bucket_00000_flush_length
As we see in the file listing above, there are 3 delta files for the
transaction range of 3299501-3300500 and bucket #0. This is because
there were multiple commits assigning values to bucket #0 for the open
transaction range. If we increased the number of rows included in each
commit as mentioned earlier in the Implementation section of this blog
or played around with the maxBatchGroups value, we might actually see
fewer of these small temporary delta files. The good news is while the hive streaming commit operations are adding
new delta files, the compaction functionality is also running in the
back ground trying to consolidate all of these files for more efficient
Namenode operations and queries. Step 4: Completion of the Hive Stream process In many instances, the Hive Stream ingestion process never actually
ends. New delta files are continually getting created in HDFS by the
commit() step and the compaction processes are continually consolidating
the delta files. As each transaction gets committed, it gets closed from
the TRANSACTIONS list. In a healthy system, you will find that over time the number of
transactions reported by the command ‘SHOW TRANSACTIONS’ should be less
than the value you have assigned to ‘maxBatchGroups’. If the number is
greater than the ‘maxBatchGroups’ then it is possible that for one of
the transactional tables that the ingest process was stopped without
executing a ‘txnBatch.close();’ operation which returns the unused
transactions. Hive Streaming Compaction Hive Compaction Design considerations The Compaction process has a set of cleaner processes running in the
background during the ingest process looking for opportunities to
compact the delta files based on the rules you specify. The first thing to keep in mind is that there are two forms of
Compaction; ‘minor’ and ‘major’. A ‘minor’ compaction will just
consolidate the delta files. This approach does not have to worry about
consolidating all of the delta files along with a large set of base
bucket files and is thus the least disruptive to the system resources.
‘major’ compaction consolidates all of the delta files just like the
‘minor’ compaction and in addition it consolidates the delta files with
the base to produce a very clean physical layout for the hive table.
However, major compactions can take minutes to hours and can consume a
lot of disk, network, memory and CPU resources, so they should be
invoked carefully. To provide greater control over the compaction process and avoid
impacting other processes in addition to the compactor configuration
options available, it is also possible to invoke compaction
automatically by the cleaner threads or manually initiated when system
load is low. The primary compaction configuration triggers to review when
implementing or tuning your compaction processes are:
hive.compactor.initiator.on hive.compactor.cleaner.run.interval hive.compactor.delta.num.threshold - Number of delta directories in
a table or partition that will trigger a minor compaction. hive.compactor.delta.pct.threshold - Percentage (fractional) size of
the delta files relative to the base that will trigger a
major compaction. 1 = 100%, so the default 0.1 = 10%. hive.compactor.abortedtxn.threshold - Number of aborted transactions
involving a given table or partition that will trigger a major
compaction A Hive Compaction Manual example In our example we have turned off major compaction as it should only run
during off load periods. We take a look at the delta files for our table
in hdfs and see that there are over 300 delta files and 5 base files. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest/_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500
-rw-r--r-- 3 mjohnson hdfs 482990 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2113501_2123500/bucket_00002_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2123501_2133500
-rw-r--r-- 3 mjohnson hdfs 482784 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2123501_2133500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2133501_2143500
-rw-r--r-- 3 mjohnson hdfs 482110 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2133501_2143500/bucket_00001_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2143501_2153500
-rw-r--r-- 3 mjohnson hdfs 476285 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 1600 2016-03-27 13:18 /apps/hive/warehouse/acidtest/delta_2143501_2153500/bucket_00000_flush_length
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:17 /apps/hive/warehouse/acidtest/delta_2153501_2163500
A decision has been been made to run the major compaction manually
during the even lull, so we execute the “ALTER TABLE {tablename} COMPACT
‘major’” command to place the compaction job into the queue for
processing. A compaction resource management queue was defined with a
limited quota resource, so the compaction will not impact other jobs. hive> alter table acidtest compact 'major';
Compaction enqueued.
OK
Time taken: 0.037 seconds
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
default acidtest NULL MAJOR working server2.hdp-26 1459100244000
Time taken: 0.019 seconds, Fetched: 2 row(s)
hive> show compactions;
OK
Database Table Partition Type State Worker Start Time
Time taken: 0.016 seconds, Fetched: 1 row(s)
hive>; The outstanding table compaction jobs are visible by executing the
command line “SHOW COMPACTIONS as illustrated in the example above. Or
the ‘major’ compaction is also visible through the Applications history
log. After the ‘major’ compaction has completed, all of the delta files
available at the time the compaction was initiated will have rolled up
into the ‘base’ tables. [hive@server1 ~]$ hadoop fs -ls -R /apps/hive/warehouse/acidtest
-rw-r--r-- 3 mjohnson hdfs 4 2016-03-27 13:17 /apps/hive/warehouse/acidtest_orc_acid_version
drwxrwxrwx - mjohnson hdfs 0 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500
-rw-r--r-- 3 mjohnson hdfs 72704 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00000
-rw-r--r-- 3 mjohnson hdfs 436159 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00001
-rw-r--r-- 3 mjohnson hdfs 219572 2016-03-27 13:37 /apps/hive/warehouse/acidtest/base_2213500/bucket_00002
[hive@server1 ~]$ The end result of this example is that 305 consolidated to just 5 files.
While 300 files will not impact the NameNode performance, it will most
likely improve query performance as the Hive engine will have fewer
files to scan to execute the query. Next Steps After completing these examples, you should have a good idea on how to setup Hive streaming in your environment. For some additional information on Hive Streaming check out the Bibliography section at the bottom of this article. Bibliography Hopefully, the example and source code supplied with this blog posting
are sufficient to get you started with Hive Streaming and avoid
potential problems. In addition to this blog posting some other
resources which are useful references include:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions http://hortonworks.com/blog/adding-acid-to-apache-hive/
... View more
Labels: