Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5129 | 09-21-2018 09:54 PM | |
6495 | 03-31-2018 03:59 AM | |
1969 | 03-31-2018 03:55 AM | |
2179 | 03-31-2018 03:31 AM | |
4833 | 03-27-2018 03:46 PM |
07-11-2016
06:52 PM
3 Kudos
@mrizvi You are missing a comma after driverName right before get_json_object(json_staging.json, "$.eventTime") if this resolves this problem, please accept this response.
... View more
07-11-2016
06:23 PM
1 Kudo
Sanjay Radia recently presented a new concept to be introduced into HDFS (Hadoop 3) called Storage Containers. Storage Containers are an extensibility mechanism that will allow HDFS to manage object storage, such as S3. Watch
https://www.youtube.com/watch?v=SdmJHmpvp7E
and see "EVOLVING HDFS TO A GENERALIZED DISTRIBUTED STORAGE SUBSYSTEM".
... View more
07-11-2016
05:23 PM
@Sunile Manjee Until available as a feature, your best answer is the one that provides a workaround using -f InitFile and include in your initFile what you need to initialize. beeline -f InitFile jdbc:hive2://WHATEVER:10000
... View more
07-11-2016
05:16 PM
3 Kudos
The file needs to be owned by hive:hdfs or the user that has proper hive privileges. It can be located anywhere.
... View more
07-09-2016
01:30 AM
9 Kudos
Objective
The objective of this article is to provide an introduction to User-Defined Functions for spatial analysis in Hive. Credit
This article
is based on multiple ESRI resources referenced throughout the article. Thanks also to my colleague Artem Ervits for his contribution to the proof-of-concept
that lead to this article. Spatial Types
Spatial types are data types that
store geometry data and they have associated functions or methods that are used
to access and manipulate the data using Structured Query Language (SQL).
Unlike Oracle, DB2, SQL Server, or PostGreSQL,
Hive does not support yet a geometry native data type. The best approach is to
store spatial data as text. That will not impact the use of UDFs available for
Hive. ESRI Libraries
The Esri Geometry API for Java includes
geometry objects (e.g. points, lines, and polygons), spatial operations (e.g.
intersects, buffer), and spatial indexing. By deploying the library (as a jar)
within Hadoop, you are able to build custom MapReduce applications using Java
to complete analysis on your spatial data. This can be used as a standalone
library, or combined with
Spatial Framework
for Hadoop
to create a SQL like experience.
The
Spatial
Framework for Hadoop
includes among others, the Hive
Spatial
library with
User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive,
you are able to construct queries using Hive Query Language (HQL), which is
very similar to SQL. This allows you to avoid complicated MapReduce algorithms
and stick to a more familiar workflow. The API used by the
Hive UDF’s could be used by developers building geometry functions for
3rd-party applications using Storm, Spark, HBase etc.
A good
description of ESRI Java Geometry Library here:
https://github.com/Esri/geometry-api-java/wiki
UDF Documentation: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation
The following chart demonstrates the
hierarchy of the ST_Geometry data type and its subclasses.
ST_Geometry's
subclasses are divided into two categories: the base geometry subclasses and
the homogeneous collection subclasses. The base geometries include ST_Point,
ST_LineString, and ST_Polygon, while the homogeneous collections include
ST_MultiPoint, ST_MultiLineString, and ST_MultiPolygon. As the names imply, the
homogeneous collections are collections of base geometries. In addition to
sharing base geometry properties, homogeneous collections have some of their
own properties.
Each
subclass stores the type of geometry implied by its name; for instance,
ST_MultiPoint stores multipoints. A list of the subclasses and their
descriptions are in the following table:
Subtype
Description
ST_Point
A
zero-dimensional geometry that occupies a single location in coordinate space
Has
a single x,y coordinate value, is always simple, and has a NULL boundary
ST_LineString
A
one-dimensional object stored as a sequence of points defining a linear
interpolated path
ST_LineStrings
have length.
The
ST_LineString is simple if it does not intersect its interior.
The
endpoints (the boundary) of a closed ST_LineString occupy the same point in
space.
An
ST_LineString is a ring if it is both closed and simple.
The
endpoints normally form the boundary of an ST_LineString unless the
ST_LineString is closed, in which case the boundary is NULL.
The
interior of an ST_LineString is the connected path that lies between the
endpoints, unless it is closed, in which case the interior is continuous.
ST_Polygon
A
two-dimensional surface stored as a sequence of points defining its exterior
bounding ring and zero or more interior rings
ST_Polygon
has area and is always simple.
The
exterior and any interior rings define the boundary of an ST_Polygon, and the
space enclosed between the rings defines the ST_Polygon's interior.
The
rings of an ST_Polygon can intersect at a tangent point but never cross.
ST_MultiPoint
A
collection of ST_Points
Has
a dimension of 0
An
ST_MultiPoint is simple if none of its elements occupy the same coordinate
space.
The
boundary of an ST_MultiPoint is NULL.
ST_MultiLineString
A
collection of ST_LineStrings
ST_MultiLineStrings
have length.
ST_MultiLineStrings
are simple if they only intersect at the endpoints of the ST_LineString
elements.
ST_MultiLineStrings
are nonsimple if the interiors of the ST_LineString elements intersect.
The
boundary of an ST_MultiLineString is the nonintersected endpoints of the
ST_LineString elements.
The
ST_MultiLineString is closed if all of its ST_LineString elements are closed.
The
boundary of an ST_MultiLineString is NULL if all the endpoints of all the
elements are intersected.
ST_MultiPolygon
A
collection of polygons
ST_MultiPolygons
have area.
The
boundary of an ST_MultiPolygon is the cumulative length of its elements'
exterior and interior rings.
The
interior of an ST_MultiPolygon is defined as the cumulative interiors of its
element ST_Polygons.
The
boundary of an ST_MultiPolygon's elements can only intersect at a tangent
point.
Note
that each subclass inherits the properties of the ST_Geometry superclass but
also has properties of its own. Functions that operate on the ST_Geometry data
type accept any of the subclass entity types. However, some functions have been
defined at the subclass level and only accept certain subclasses. For example,
the ST_GeometryN function only takes ST_MultiLinestring, ST_MultiPoint, or
ST_MultiPolygon subtype values as input.
To
discover the subclass of an ST_Geometry, you can use the ST_GeometryType function.
The ST_GeometryType function takes an ST_Geometry and returns the instantiated
subclass in the form of a character string. To find out how many base geometry
elements are contained in a homogeneous collection, you can use the
ST_NumGeometries function, which takes a homogeneous collection and returns the
number of base geometry elements it contains. Libraries
For this demo, use the jar files located on
the esri.zip file at
https://github.com/cstanca1/hdp-hive-spatial.
We built them from the latest source code modifying build.xml and pom.xml
for both,
https://github.com/Esri/spatial-framework-for-hadoop
and the dependency library https://github.com/Esri/geometry-api-java,
to reference Hive 1.2 and Hadoop 2.7, to
match HDP 2.4.2 content. If you use Hive 1.1, you can download the binaries
from
https://github.com/Esri/spatial-framework-for-hadoop/releases Demo Steps
Demo steps
are also described at:
https://github.com/cstanca1/hdp-hive-spatial
1) Clone the repo locally: git clone https://github.com/cstanca1/hdp-hive-spatial.git
2) Extract the 3 jars files
from esri.zip and copy them to hdfs. Grant ownership on these jar files to
hive:hdfs.
3) Connect to your hive
database using your preferred CLI and execute the content of
spatial-geometry-demo-ddl-dml.txt to create tables and populate with sample
data.
4) Before executing any
query, add ESRI jar libraries to your session execution path, for the session
or globally. add jar hdfs://YourHDFSClientNode:8020/esri/esri-geometry-api.jar;
add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-hive-1.1.1-SNAPSHOT.jar;
add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-json-1.1.1-SNAPSHOT.jar;
5) Define temporary
functions for better SQL-like experience. create temporary function st_geomfromtext as 'com.esri.hadoop.hive.ST_GeomFromText';
create temporary function st_geometrytype as 'com.esri.hadoop.hive.ST_GeometryType';
create temporary function st_point as 'com.esri.hadoop.hive.ST_Point';
create temporary function st_asjson as 'com.esri.hadoop.hive.ST_AsJson';
create temporary function st_asbinary as 'com.esri.hadoop.hive.ST_AsBinary';
create temporary function st_astext as 'com.esri.hadoop.hive.ST_AsText';
create temporary function st_intersects as 'com.esri.hadoop.hive.ST_Intersects';
create temporary function st_x as 'com.esri.hadoop.hive.ST_X';
create temporary function st_y as 'com.esri.hadoop.hive.ST_Y';
create temporary function st_srid as 'com.esri.hadoop.hive.ST_SRID';
create temporary function st_linestring as 'com.esri.hadoop.hive.ST_LineString';
create temporary function st_pointn as 'com.esri.hadoop.hive.ST_PointN';
create temporary function st_startpoint as 'com.esri.hadoop.hive.ST_StartPoint';
create temporary function st_endpoint as 'com.esri.hadoop.hive.ST_EndPoint';
create temporary function st_numpoints as 'com.esri.hadoop.hive.ST_NumPoints';
6) Execute the various
spatial queries included in spatial-geometry-demo-queries.txt file also listed
below:
Counts by geometry type --
assumes that other than ST_POINT values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_point
group by st_geometrytype(st_geomfromtext(shape));
Counts by geometry type --
assumes that other than ST_LINESTRING values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_linestring
group by st_geometrytype(st_geomfromtext(shape));
Counts by geometry type --
assumes that other than ST_POLYGON values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_polygon
group by st_geometrytype(st_geomfromtext(shape));
X and Y coordinates of the
point: select st_x(st_point(shape)) AS X, st_y(st_point(shape)) AS Y
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Extract geometry from text
shape: select st_geomfromtext(shape)
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Geometry type: select st_geometrytype(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as a binary -
implicitly: select st_point(shape)
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as a binary -
explicitly: select st_asbinary(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as Json: select st_asjson(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as a text: select st_astext(st_point(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
SRID for a point: select st_srid(st_point(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Line as text: select st_astext(st_linestring(shape))
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;
n point of a line: select st_astext(st_point(st_astext(st_pointn(st_linestring(shape), 2))))
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;
Start and end points of a
line: select
st_astext(st_startpoint(st_linestring(shape))) AS StartPoint,
st_astext(st_endpoint(st_linestring(shape))) AS EndPoint
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;
Number of points in a
polygon: select shape, st_numpoints(st_geomfromtext(shape)) as NumPoints
from demo_shape_polygon
where st_geometrytype(st_geomfromtext(shape))= "ST_POLYGON"
limit 1;
Lines intersection - usually
you would have two tables - this is just an example with a table with one row: select st_intersects(a.s1, b.s2)
from
(select st_point(shape) AS `s1` from demo_shape_point limit 1) a join
(select st_point(shape) AS `s2` from demo_shape_point limit 1) b
limit 1;
Lines intersection - usually
you would have two tables - this is just an example with a table with one row: select st_intersects(a.s1, b.s2)
from
(select st_linestring(0,0, 1,1) AS `s1` from demo_shape_linestring limit 1) a join
(select st_linestring(1,1, 0,0) AS `s2` from demo_shape_linestring limit 1) b
limit 1; Conclusion
The Esri Geometry API for Java and the Spatial
Framework for Hadoop
could be used by developers building geometry functions for various
geo-spatial applications using Hive, Storm, Spark, HBase, anything from the big
data ecosystem.
... View more
Labels:
07-06-2016
10:47 PM
3 Kudos
@david serafini OpenJDK and Oracle JDK have almost the same code, but Oracle JDK has more classes and more bugs fixed and is thoroughly tested and stable. There is no incentive yet to use OpenJDK over Oracle JDK at least not until Oracle starts to charge for it. The decision of the business is to use the most stable and scalable. This is pretty much the rule for any business selling or using enterprise level software.
... View more
07-06-2016
09:54 PM
1 Kudo
@Mark Davis You may have done all the following, but let's make sure and let me know your findings. All hosts in your system must be configured for DNS and Reverse DNS. These quick links, "oddly", need a reverse DNS. For example, test if you can nslookup by IP and get the proper FQDN and nslookup by FQDN and get the proper IP address. If this does not work, your quicklinks will not work. You would have to manually enter the IP address on the quick link URL to bring the expected function. If you are unable to configure DNS and Reverse DNS, you must edit the hosts file on every host in your cluster to contain the address of each of your hosts and to set the Fully Qualified Domain Name hostname of each of those hosts. The following instructions cover basic hostname network setup for generic Linux hosts: https://ambari.apache.org/1.2.4/installing-hadoop-using-ambari/content/ambari-chap1-5-4.html
... View more
07-06-2016
09:32 PM
@Alex McLintock You are correct. This is a local reference to a remote repo. Here is a reference for an actual local repo with no Internet access, but it gets complicated with Vagrant: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_setting_up_a_local_repository_with_no_internet_access.html
... View more
06-23-2016
02:12 PM
@@Altul Gupta I actually recommended to NOT use Golden Gate but databus. Pls read my earlier response. Vote if helpful.
... View more
06-21-2016
08:55 PM
2 Kudos
@Muthukumar S Assuming that you run a secure cluster, if the DataNodes are started as the root user, JSVC downgrades the processing using setuid to hdfs. However, the ulimit is based on the ulimit of the root user, and the default ulimit values assigned to the root user for the maximum number of open files and processes may be too low for a secure cluster. From what you provided above, you increased the limits for abiuser user. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-729d1fb0-6d1b-459f-a18a-b5eba4540ab5.1.html If this is helpful, please vote this response.
... View more