Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Super Guru

Objective

The objective of this article is to provide an introduction to User-Defined Functions for spatial analysis in Hive.

Credit

This article is based on multiple ESRI resources referenced throughout the article. Thanks also to my colleague Artem Ervits for his contribution to the proof-of-concept that lead to this article.

Spatial Types

Spatial types are data types that store geometry data and they have associated functions or methods that are used to access and manipulate the data using Structured Query Language (SQL).

Unlike Oracle, DB2, SQL Server, or PostGreSQL, Hive does not support yet a geometry native data type. The best approach is to store spatial data as text. That will not impact the use of UDFs available for Hive.

ESRI Libraries

The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience.

The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc.

A good description of ESRI Java Geometry Library here: https://github.com/Esri/geometry-api-java/wiki

UDF Documentation: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation

The following chart demonstrates the hierarchy of the ST_Geometry data type and its subclasses.

5665-screen-shot-2016-07-08-at-64805-pm.png

ST_Geometry's subclasses are divided into two categories: the base geometry subclasses and the homogeneous collection subclasses. The base geometries include ST_Point, ST_LineString, and ST_Polygon, while the homogeneous collections include ST_MultiPoint, ST_MultiLineString, and ST_MultiPolygon. As the names imply, the homogeneous collections are collections of base geometries. In addition to sharing base geometry properties, homogeneous collections have some of their own properties.

Each subclass stores the type of geometry implied by its name; for instance, ST_MultiPoint stores multipoints. A list of the subclasses and their descriptions are in the following table:

Subtype Description
ST_Point A zero-dimensional geometry that occupies a single location in coordinate space

Has a single x,y coordinate value, is always simple, and has a NULL boundary

ST_LineString A one-dimensional object stored as a sequence of points defining a linear interpolated path

ST_LineStrings have length.

The ST_LineString is simple if it does not intersect its interior.

The endpoints (the boundary) of a closed ST_LineString occupy the same point in space.

An ST_LineString is a ring if it is both closed and simple.

The endpoints normally form the boundary of an ST_LineString unless the ST_LineString is closed, in which case the boundary is NULL.

The interior of an ST_LineString is the connected path that lies between the endpoints, unless it is closed, in which case the interior is continuous.

ST_Polygon A two-dimensional surface stored as a sequence of points defining its exterior bounding ring and zero or more interior rings

ST_Polygon has area and is always simple.

The exterior and any interior rings define the boundary of an ST_Polygon, and the space enclosed between the rings defines the ST_Polygon's interior.

The rings of an ST_Polygon can intersect at a tangent point but never cross.

ST_MultiPoint A collection of ST_Points

Has a dimension of 0

An ST_MultiPoint is simple if none of its elements occupy the same coordinate space.

The boundary of an ST_MultiPoint is NULL.

ST_MultiLineString A collection of ST_LineStrings

ST_MultiLineStrings have length.

ST_MultiLineStrings are simple if they only intersect at the endpoints of the ST_LineString elements.

ST_MultiLineStrings are nonsimple if the interiors of the ST_LineString elements intersect.

The boundary of an ST_MultiLineString is the nonintersected endpoints of the ST_LineString elements.

The ST_MultiLineString is closed if all of its ST_LineString elements are closed.

The boundary of an ST_MultiLineString is NULL if all the endpoints of all the elements are intersected.

ST_MultiPolygon A collection of polygons

ST_MultiPolygons have area.

The boundary of an ST_MultiPolygon is the cumulative length of its elements' exterior and interior rings.

The interior of an ST_MultiPolygon is defined as the cumulative interiors of its element ST_Polygons.

The boundary of an ST_MultiPolygon's elements can only intersect at a tangent point.

Note that each subclass inherits the properties of the ST_Geometry superclass but also has properties of its own. Functions that operate on the ST_Geometry data type accept any of the subclass entity types. However, some functions have been defined at the subclass level and only accept certain subclasses. For example, the ST_GeometryN function only takes ST_MultiLinestring, ST_MultiPoint, or ST_MultiPolygon subtype values as input.

To discover the subclass of an ST_Geometry, you can use the ST_GeometryType function. The ST_GeometryType function takes an ST_Geometry and returns the instantiated subclass in the form of a character string. To find out how many base geometry elements are contained in a homogeneous collection, you can use the ST_NumGeometries function, which takes a homogeneous collection and returns the number of base geometry elements it contains.

Libraries

For this demo, use the jar files located on the esri.zip file at https://github.com/cstanca1/hdp-hive-spatial. We built them from the latest source code modifying build.xml and pom.xml for both, https://github.com/Esri/spatial-framework-for-hadoop and the dependency library https://github.com/Esri/geometry-api-java, to reference Hive 1.2 and Hadoop 2.7, to match HDP 2.4.2 content. If you use Hive 1.1, you can download the binaries from https://github.com/Esri/spatial-framework-for-hadoop/releases

Demo Steps

Demo steps are also described at: https://github.com/cstanca1/hdp-hive-spatial

1) Clone the repo locally:

git clone https://github.com/cstanca1/hdp-hive-spatial.git

2) Extract the 3 jars files from esri.zip and copy them to hdfs. Grant ownership on these jar files to hive:hdfs.

3) Connect to your hive database using your preferred CLI and execute the content of spatial-geometry-demo-ddl-dml.txt to create tables and populate with sample data.

4) Before executing any query, add ESRI jar libraries to your session execution path, for the session or globally.

add jar hdfs://YourHDFSClientNode:8020/esri/esri-geometry-api.jar;
add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-hive-1.1.1-SNAPSHOT.jar;
add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-json-1.1.1-SNAPSHOT.jar;

5) Define temporary functions for better SQL-like experience.

create temporary function st_geomfromtext as 'com.esri.hadoop.hive.ST_GeomFromText';
create temporary function st_geometrytype as 'com.esri.hadoop.hive.ST_GeometryType';
create temporary function st_point as 'com.esri.hadoop.hive.ST_Point';
create temporary function st_asjson as 'com.esri.hadoop.hive.ST_AsJson';
create temporary function st_asbinary as 'com.esri.hadoop.hive.ST_AsBinary';
create temporary function st_astext as 'com.esri.hadoop.hive.ST_AsText';
create temporary function st_intersects as 'com.esri.hadoop.hive.ST_Intersects';
create temporary function st_x as 'com.esri.hadoop.hive.ST_X';
create temporary function st_y as 'com.esri.hadoop.hive.ST_Y';
create temporary function st_srid as 'com.esri.hadoop.hive.ST_SRID';
create temporary function st_linestring as 'com.esri.hadoop.hive.ST_LineString';
create temporary function st_pointn as 'com.esri.hadoop.hive.ST_PointN';
create temporary function st_startpoint as 'com.esri.hadoop.hive.ST_StartPoint';
create temporary function st_endpoint as 'com.esri.hadoop.hive.ST_EndPoint';
create temporary function st_numpoints as 'com.esri.hadoop.hive.ST_NumPoints';

6) Execute the various spatial queries included in spatial-geometry-demo-queries.txt file also listed below:

Counts by geometry type -- assumes that other than ST_POINT values are possible:

select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_point
group by st_geometrytype(st_geomfromtext(shape));

Counts by geometry type -- assumes that other than ST_LINESTRING values are possible:

select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_linestring
group by st_geometrytype(st_geomfromtext(shape));

Counts by geometry type -- assumes that other than ST_POLYGON values are possible:

select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_polygon
group by st_geometrytype(st_geomfromtext(shape));

X and Y coordinates of the point:

select st_x(st_point(shape)) AS X, st_y(st_point(shape)) AS Y
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Extract geometry from text shape:

select st_geomfromtext(shape)
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Geometry type:

select st_geometrytype(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Point geometry as a binary - implicitly:

select st_point(shape)
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Point geometry as a binary - explicitly:

select st_asbinary(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Point geometry as Json:

select st_asjson(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Point geometry as a text:

select st_astext(st_point(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

SRID for a point:

select st_srid(st_point(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;

Line as text:

select st_astext(st_linestring(shape))
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;

n point of a line:

select st_astext(st_point(st_astext(st_pointn(st_linestring(shape), 2))))
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;

Start and end points of a line:

select 
    st_astext(st_startpoint(st_linestring(shape))) AS StartPoint, 
    st_astext(st_endpoint(st_linestring(shape))) AS EndPoint
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;

Number of points in a polygon:

select shape, st_numpoints(st_geomfromtext(shape)) as NumPoints
from demo_shape_polygon
where st_geometrytype(st_geomfromtext(shape))= "ST_POLYGON"
limit 1;

Lines intersection - usually you would have two tables - this is just an example with a table with one row:

select st_intersects(a.s1, b.s2)
from
    (select st_point(shape) AS  `s1` from demo_shape_point limit 1) a join
    (select st_point(shape) AS  `s2` from demo_shape_point limit 1) b
limit 1;

Lines intersection - usually you would have two tables - this is just an example with a table with one row:

select st_intersects(a.s1, b.s2)
from
    (select st_linestring(0,0, 1,1) AS  `s1` from demo_shape_linestring limit 1) a join
    (select st_linestring(1,1, 0,0) AS  `s2` from demo_shape_linestring limit 1) b
limit 1;

Conclusion

The Esri Geometry API for Java and the Spatial Framework for Hadoop could be used by developers building geometry functions for various geo-spatial applications using Hive, Storm, Spark, HBase, anything from the big data ecosystem.

18,896 Views
Comments

Multi-Polygon example:

SELECT st_astext(ST_MultiPolygon('multipolygon (((0 0, 0 1, 1 0, 0 0)), ((2 2, 2 3, 3 2, 2 2)))')) from YourTable LIMIT 1;