Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
07-29-2016
09:19 AM
9 Kudos
Introduction This
article is a complement to Geo-spatial Queries with Hive using ESRI Geometry and Spatial Framework for Hadoop and includes a few more findings, mainly documenting the differences
between ST_Geometry functions supported in Hive and those in commercial spatial packages for Oracle, SQL Server or Netezza. The Hive UDF's are modeled after existing
implementations of ST_Geometry. Some functions exist only in Hive’s
implementation, a few behave different or don’t exist. Additional ST_Geometry Functions in Hive ST_GeomFromJSON - Create a geometry from Esri
JSON ST_AsJSON -
Return Esri JSON representation of geometry ST_GeomFromGeoJSON - Create a geometry from GeoJSON ST_AsGeoJSON -
Return GeoJSON representation of geometry (requires geometry-api-java version
1.1) ST_PointZ -
constructor for 3D points from direct position ST_SetSRID -
set the spatial-reference ID, for a geometry constructed without it ST_GeodesicLengthWGS84 - geodesic length in meters
rather than in angles ST_Bin, ST_BinEnvelope - aggregate into
regular grid, tutorial here ST_Geometry Functions with Different Behavior Overloaded
constructors - These overloaded constructors differ from other
ST_Geometry implementations in how the caller can specify the spatial-reference
ID. Default SRID is plane, when the SRID is not specified. Hive does not accept SRID in second argument - wrap with ST_SetSRID or use ST_GeomFromText. Applies to ST_Point, ST_LineString, ST_Polygon, ST_MultiPoint, ST_MultiLineString, ST_MultiPolygon. ST_PointN - Return type varies in
the case of index out of range - Hive: null; ST_AsText -
The OGC WKT standard dictates that a MultiPoint is represented as MULTIPOINT
((1 2),(3 4)); however some
existing WKT parsers accept only MULTIPOINT (1 2, 3 4). ST_AsText outputs the former,
compliant format, with the nested parentheses. ST_Envelope -
In the case of a point or a vertical or horizontal line, ST_Envelope may either
apply a tolerance or return an empty envelope. ST_Intersection - In the case where the two
geometries intersect in a lower dimension, ST_Intersection may drop the
lower-dimension intersections, or output a closed linestring. ST_Intersection(ST_Polygon(2,0,3,1, 2,1), ST_Polygon(1,1, 4,1, 4,4, 1,4))) -- MULTIPOLYGON EMPTY
or
LINESTRING(2 1, 3 1, 2 1) ST_Union -
may drop lower-dimension members of the union ST_Union(ST_LineString(2,3,4,5), ST_Point(1,1)) -- MULTILINESTRING ((2 3, 4 5)) ST_SymmetricDiff - Hive-spatial follows the naming
in the Esri implemention of ST_Geometry. For the OGC naming, use an alias: create temporary function ST_SymDifference as 'com.esri.hadoop.hive.ST_SymmetricDiff'; ST_Geometry Functions Not Supported in Hive ST_GeomCollection, ST_NumGeometries, ST_GeometryN -
collection of varying geometry types is not supported; hive supports arrays ST_Geometry - no constructor of this name -
use one of the other constructors, e.g. ST_GeomFromText ST_Curve, ST_Surface, ST_MultiCurve, ST_MultiSurface -
Curve and Surface constructors not supported ST_PointOnSurface - ST_PointOnSurface is not
supported on Hive ST_GeoSize - ST_GeoSize is not supported on
Hive ST_Transform - ST_Transform is not supported
on Hive ST_Equalsrs - ST_Equalsrs is not supported on
Hive Key Resources Complete list of functions
supported in Hive: https://github.com/Esri/spatial-framework-for-hadoop/blob/master/hive/function-ddl.sql UDF documentation: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation Tips To share functions globally across
sessions create them without “temporary” option. This has the advantage that you do not need to declare the functions for every session. create function ST_AsBinary as'com.esri.hadoop.hive.ST_AsBinary' You can also include the jar file in the create
function statement which makes it easier to create a permanent declaration. For example, for the definition of
the ST_Point function you would write the following SQL statement: create function ST_Point as ‘com.esri.hadoop.hive.ST_Point‘ using jar ‘hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-hive-1.1.1-SNAPSHOT.jar’;
Final Notes As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for
traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially
licensed packages. The best option to compensate for the 5-10% missing
functions is to contribute to ESRI’s open source repository: https://github.com/Esri/spatial-framework-for-hadoop. ESRI does not provide a commercial library
for Hive including all spatial functions. Thanks to Mansour Raad, Big Data Expert at ESRI for his candide and valuable input. Check Mansour's last article: Geospatial Big Data - the next big trend in analytics
... View more
Labels:
07-14-2016
02:21 AM
Multi-Polygon example: SELECT st_astext(ST_MultiPolygon('multipolygon (((0 0, 0 1, 1 0, 0 0)), ((2 2, 2 3, 3 2, 2 2)))'))
from YourTable LIMIT 1;
... View more
07-09-2016
01:30 AM
9 Kudos
Objective
The objective of this article is to provide an introduction to User-Defined Functions for spatial analysis in Hive. Credit
This article
is based on multiple ESRI resources referenced throughout the article. Thanks also to my colleague Artem Ervits for his contribution to the proof-of-concept
that lead to this article. Spatial Types
Spatial types are data types that
store geometry data and they have associated functions or methods that are used
to access and manipulate the data using Structured Query Language (SQL).
Unlike Oracle, DB2, SQL Server, or PostGreSQL,
Hive does not support yet a geometry native data type. The best approach is to
store spatial data as text. That will not impact the use of UDFs available for
Hive. ESRI Libraries
The Esri Geometry API for Java includes
geometry objects (e.g. points, lines, and polygons), spatial operations (e.g.
intersects, buffer), and spatial indexing. By deploying the library (as a jar)
within Hadoop, you are able to build custom MapReduce applications using Java
to complete analysis on your spatial data. This can be used as a standalone
library, or combined with
Spatial Framework
for Hadoop
to create a SQL like experience.
The
Spatial
Framework for Hadoop
includes among others, the Hive
Spatial
library with
User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive,
you are able to construct queries using Hive Query Language (HQL), which is
very similar to SQL. This allows you to avoid complicated MapReduce algorithms
and stick to a more familiar workflow. The API used by the
Hive UDF’s could be used by developers building geometry functions for
3rd-party applications using Storm, Spark, HBase etc.
A good
description of ESRI Java Geometry Library here:
https://github.com/Esri/geometry-api-java/wiki
UDF Documentation: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation
The following chart demonstrates the
hierarchy of the ST_Geometry data type and its subclasses.
ST_Geometry's
subclasses are divided into two categories: the base geometry subclasses and
the homogeneous collection subclasses. The base geometries include ST_Point,
ST_LineString, and ST_Polygon, while the homogeneous collections include
ST_MultiPoint, ST_MultiLineString, and ST_MultiPolygon. As the names imply, the
homogeneous collections are collections of base geometries. In addition to
sharing base geometry properties, homogeneous collections have some of their
own properties.
Each
subclass stores the type of geometry implied by its name; for instance,
ST_MultiPoint stores multipoints. A list of the subclasses and their
descriptions are in the following table:
Subtype
Description
ST_Point
A
zero-dimensional geometry that occupies a single location in coordinate space
Has
a single x,y coordinate value, is always simple, and has a NULL boundary
ST_LineString
A
one-dimensional object stored as a sequence of points defining a linear
interpolated path
ST_LineStrings
have length.
The
ST_LineString is simple if it does not intersect its interior.
The
endpoints (the boundary) of a closed ST_LineString occupy the same point in
space.
An
ST_LineString is a ring if it is both closed and simple.
The
endpoints normally form the boundary of an ST_LineString unless the
ST_LineString is closed, in which case the boundary is NULL.
The
interior of an ST_LineString is the connected path that lies between the
endpoints, unless it is closed, in which case the interior is continuous.
ST_Polygon
A
two-dimensional surface stored as a sequence of points defining its exterior
bounding ring and zero or more interior rings
ST_Polygon
has area and is always simple.
The
exterior and any interior rings define the boundary of an ST_Polygon, and the
space enclosed between the rings defines the ST_Polygon's interior.
The
rings of an ST_Polygon can intersect at a tangent point but never cross.
ST_MultiPoint
A
collection of ST_Points
Has
a dimension of 0
An
ST_MultiPoint is simple if none of its elements occupy the same coordinate
space.
The
boundary of an ST_MultiPoint is NULL.
ST_MultiLineString
A
collection of ST_LineStrings
ST_MultiLineStrings
have length.
ST_MultiLineStrings
are simple if they only intersect at the endpoints of the ST_LineString
elements.
ST_MultiLineStrings
are nonsimple if the interiors of the ST_LineString elements intersect.
The
boundary of an ST_MultiLineString is the nonintersected endpoints of the
ST_LineString elements.
The
ST_MultiLineString is closed if all of its ST_LineString elements are closed.
The
boundary of an ST_MultiLineString is NULL if all the endpoints of all the
elements are intersected.
ST_MultiPolygon
A
collection of polygons
ST_MultiPolygons
have area.
The
boundary of an ST_MultiPolygon is the cumulative length of its elements'
exterior and interior rings.
The
interior of an ST_MultiPolygon is defined as the cumulative interiors of its
element ST_Polygons.
The
boundary of an ST_MultiPolygon's elements can only intersect at a tangent
point.
Note
that each subclass inherits the properties of the ST_Geometry superclass but
also has properties of its own. Functions that operate on the ST_Geometry data
type accept any of the subclass entity types. However, some functions have been
defined at the subclass level and only accept certain subclasses. For example,
the ST_GeometryN function only takes ST_MultiLinestring, ST_MultiPoint, or
ST_MultiPolygon subtype values as input.
To
discover the subclass of an ST_Geometry, you can use the ST_GeometryType function.
The ST_GeometryType function takes an ST_Geometry and returns the instantiated
subclass in the form of a character string. To find out how many base geometry
elements are contained in a homogeneous collection, you can use the
ST_NumGeometries function, which takes a homogeneous collection and returns the
number of base geometry elements it contains. Libraries
For this demo, use the jar files located on
the esri.zip file at
https://github.com/cstanca1/hdp-hive-spatial.
We built them from the latest source code modifying build.xml and pom.xml
for both,
https://github.com/Esri/spatial-framework-for-hadoop
and the dependency library https://github.com/Esri/geometry-api-java,
to reference Hive 1.2 and Hadoop 2.7, to
match HDP 2.4.2 content. If you use Hive 1.1, you can download the binaries
from
https://github.com/Esri/spatial-framework-for-hadoop/releases Demo Steps
Demo steps
are also described at:
https://github.com/cstanca1/hdp-hive-spatial
1) Clone the repo locally: git clone https://github.com/cstanca1/hdp-hive-spatial.git
2) Extract the 3 jars files
from esri.zip and copy them to hdfs. Grant ownership on these jar files to
hive:hdfs.
3) Connect to your hive
database using your preferred CLI and execute the content of
spatial-geometry-demo-ddl-dml.txt to create tables and populate with sample
data.
4) Before executing any
query, add ESRI jar libraries to your session execution path, for the session
or globally. add jar hdfs://YourHDFSClientNode:8020/esri/esri-geometry-api.jar;
add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-hive-1.1.1-SNAPSHOT.jar;
add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-json-1.1.1-SNAPSHOT.jar;
5) Define temporary
functions for better SQL-like experience. create temporary function st_geomfromtext as 'com.esri.hadoop.hive.ST_GeomFromText';
create temporary function st_geometrytype as 'com.esri.hadoop.hive.ST_GeometryType';
create temporary function st_point as 'com.esri.hadoop.hive.ST_Point';
create temporary function st_asjson as 'com.esri.hadoop.hive.ST_AsJson';
create temporary function st_asbinary as 'com.esri.hadoop.hive.ST_AsBinary';
create temporary function st_astext as 'com.esri.hadoop.hive.ST_AsText';
create temporary function st_intersects as 'com.esri.hadoop.hive.ST_Intersects';
create temporary function st_x as 'com.esri.hadoop.hive.ST_X';
create temporary function st_y as 'com.esri.hadoop.hive.ST_Y';
create temporary function st_srid as 'com.esri.hadoop.hive.ST_SRID';
create temporary function st_linestring as 'com.esri.hadoop.hive.ST_LineString';
create temporary function st_pointn as 'com.esri.hadoop.hive.ST_PointN';
create temporary function st_startpoint as 'com.esri.hadoop.hive.ST_StartPoint';
create temporary function st_endpoint as 'com.esri.hadoop.hive.ST_EndPoint';
create temporary function st_numpoints as 'com.esri.hadoop.hive.ST_NumPoints';
6) Execute the various
spatial queries included in spatial-geometry-demo-queries.txt file also listed
below:
Counts by geometry type --
assumes that other than ST_POINT values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_point
group by st_geometrytype(st_geomfromtext(shape));
Counts by geometry type --
assumes that other than ST_LINESTRING values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_linestring
group by st_geometrytype(st_geomfromtext(shape));
Counts by geometry type --
assumes that other than ST_POLYGON values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape)
from demo_shape_polygon
group by st_geometrytype(st_geomfromtext(shape));
X and Y coordinates of the
point: select st_x(st_point(shape)) AS X, st_y(st_point(shape)) AS Y
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Extract geometry from text
shape: select st_geomfromtext(shape)
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Geometry type: select st_geometrytype(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as a binary -
implicitly: select st_point(shape)
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as a binary -
explicitly: select st_asbinary(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as Json: select st_asjson(st_geomfromtext(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Point geometry as a text: select st_astext(st_point(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
SRID for a point: select st_srid(st_point(shape))
from demo_shape_point
where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT"
limit 1;
Line as text: select st_astext(st_linestring(shape))
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;
n point of a line: select st_astext(st_point(st_astext(st_pointn(st_linestring(shape), 2))))
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;
Start and end points of a
line: select
st_astext(st_startpoint(st_linestring(shape))) AS StartPoint,
st_astext(st_endpoint(st_linestring(shape))) AS EndPoint
from demo_shape_linestring
where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING"
limit 1;
Number of points in a
polygon: select shape, st_numpoints(st_geomfromtext(shape)) as NumPoints
from demo_shape_polygon
where st_geometrytype(st_geomfromtext(shape))= "ST_POLYGON"
limit 1;
Lines intersection - usually
you would have two tables - this is just an example with a table with one row: select st_intersects(a.s1, b.s2)
from
(select st_point(shape) AS `s1` from demo_shape_point limit 1) a join
(select st_point(shape) AS `s2` from demo_shape_point limit 1) b
limit 1;
Lines intersection - usually
you would have two tables - this is just an example with a table with one row: select st_intersects(a.s1, b.s2)
from
(select st_linestring(0,0, 1,1) AS `s1` from demo_shape_linestring limit 1) a join
(select st_linestring(1,1, 0,0) AS `s2` from demo_shape_linestring limit 1) b
limit 1; Conclusion
The Esri Geometry API for Java and the Spatial
Framework for Hadoop
could be used by developers building geometry functions for various
geo-spatial applications using Hive, Storm, Spark, HBase, anything from the big
data ecosystem.
... View more
Labels:
07-06-2016
09:32 PM
@Alex McLintock You are correct. This is a local reference to a remote repo. Here is a reference for an actual local repo with no Internet access, but it gets complicated with Vagrant: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_setting_up_a_local_repository_with_no_internet_access.html
... View more
06-13-2016
02:41 PM
15 Kudos
Objective Deploy a 4-node HDP 2.4.2 cluster with Apache Ambari 2.2.2, Vagrant and VirtualBox on OS X host. This is helpful for development and proof of concepts. Scope This approach has been tested on OS X host, but it should work on all supported Vagrant and VirtualBox environments. Pre-requisites Minimum 9 GB of RAM for the HDP 2.4.2 cluster Download and install Vagrant for your host OS: https://www.vagrantup.com/downloads.html Download and install VirtualBox for your host OS: https://www.virtualbox.org/wiki/Downloads Download and install git client for your host Open a command shell and change to the folder where you plan to clone the github repository Clone the following git repository git clone https://github.com/cstanca1/hdp2_4_2-vagrant.git Create and Start VMs Change directory to /hdp_2.4.2-vagrant, the folder that includes Vagrantfile and create a /data folder: mkdir data This /data folder will be needed for guest VMs to share with the host. Vagrant (via Vagrantfile) is configured to use Centos 6.7 as the base box and includes the pre-requisites for installing HDP. 4 VMs will be created: 1 Ambari Server (ambari1), 1 Hadoop master (master1) and 2 slaves (slave1, slave2). vagrant up ambari1
Install and Setup Ambari Server Set a Local Reference to a Remote Ambari Repo vagrant ssh ambari1
sudo su -
cd /etc/yum.repos.d
wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.2.2.0/ambari.repo Setup SSH Access and Starting the Other 3 VMs Add at the same path with files downloaded from the repoosity, your id_rsa and id_rsa.pub keys (seehttps://wiki.centos.org/HowTos/Network/SecuringSSH section 7 for instructions on CentOS). You could perform these steps on ambari1 VM and copy these two files to your /vagrant_data folder which shares data between guest and host. Only after you copy those two files, start the other three VMs: vagrant up master1
vagrant up slave1
vagrant up slave2 Install Ambari Server yum install ambari-server Setup Ambari Server Run the setup command to configure your Ambari Server, Database, JDK, LDAP, and other options: ambari-server setup Start Ambari Server ambari-server start Deploy Cluster using Ambari Web UI Open up a web browser and go to: http://ambari1:8080 Log in with username admin and password admin and follow on-screen instructions, using hosts created and selecting services of interest. For more details, see "Automated Install" at: https://docs.hortonworks.com/HDPDocuments/Ambari/Ambari-2.2.2.0/index.html
... View more
06-11-2016
11:03 PM
@Armando Segnini Thank you so much for your review. Your findings were spot-on. I had a few typos and omitted a mv command. Excellent catches.
... View more
06-01-2016
01:47 AM
17 Kudos
Introduction Ambari 2.2.2 provides a better understanding of cluster health and performance metrics through advanced visualizations and pre-built dashboards, isolating critical metrics for core cluster services such as Kafka reducing time to troubleshoot problems, and improving the level of service for cluster tenants. Ambari Metrics System (AMS) has also a new API to discover metrics. Only HDFS, YARN and HBASE default dashboards are included along with System metrics. Ambari 2.2.2 comes with built-in Grafana integration. Kafka default Grafana dashboard is targeted for Ambari 2.4. Until then, you can build a custom dashboard to see over 30 Kafka BrokerTopicMetrics using Grafana. Tutorial 1. Install Ambari 2.2.2 or Upgrade to Ambari 2.2.2. On upgrade, to enable Grafana, follow instructions from Upgrade to Ambari 2.2.2. 2. Automated Install of HDP 2.4 with Ambari 2.2.2, including Kafka 0.9.0.2.4. Grafana is added automatically on a new install. 3. Explore Ambari Kafka Metrics that can be accessed as: http://<ams-host>:6188/ws/v1/timeline/metrics/metadata/ 4. Add Grafana to Ambari Metrics. 5. Access Grafana to see out-of-box dashboards. Port 3000 is the default for Grafana UI. This is view-only. Built-in Grafana does not allow creation of new dashboards. 6. A few Kafka metrics are displayed by default, however, to see more Kafka metrics, click on Kafka link on the left nav, then click the big "+" sign to add a new widget for one or many metrics. Select a widget from Widget Browser window: Click on "Create Widget", select a widget type, let's say "Graph", "Add Metric" Kafka/ All Kafka Brokers, add a metric, for example kafka.server.BrokerTopicMetrics.BytestInPerSec.1MinuteRate and select the aggregation type. Follow screen instructions and save. The new widget will be added to Kafka metrics dashboard. Repeat the steps for all Kafka Broker Topic Metrics, as necessary. Unfortunately, you can't add a widget for a specific topic or a specific broker, at least not yet. 7. To create a custom dashboard adding desired Kafka topic & broker metrics, a separate Grafana installation is needed. This could be on your development machine. Follow instructions to build a custom dashboard. After complete, deploy on the server. Conclusion Ambari 2.2.2 metrics dashboard can provide a reasonable insight on all brokers metrics. Combined with Burrow and the promise for more Kafka metrics in the next versions of Ambari, the near future seems promising for Kafka monitoring.
... View more
Labels:
05-27-2016
09:34 PM
@Vladimir Zlatkin Updated section "NUMA optimization" to include a link to OS CPU optimizations for RHEL. "Spark applications performance could be improved not only by configuring various Spark parameters and JVM options, but also using the operating system side optimizations, e.g. CPU affinity, NUMA policy, hardware performance policy etc. to take advantage of the most recent hardware NUMA capable." The referenced section is vast. We could qualify some of the settings to best choices. This could be a follow-up article if it part 1 presented real interest.
... View more
05-24-2016
06:49 PM
Section 2.7. You probably meant /lib instead of /bin. That's where nar files are deployed.
... View more
05-24-2016
03:19 PM
In production, for remote access, you would have to deal with firewall issues, however, for special cases when high severity issues troubleshooting is needed, Ops folks may agree to perform the needed changes. Additionally, you need to start the JVM with something like this in order to be able to truly access the JVM remotely (from a different host): -Djava.rmi.server.hostname = host ip , which forces RMI service to use the host ip instead of 127.0.0.1. By the nature of the Hadoop beast, most of the tools in the ecosystem would have multiple JVMs and some of them would be volatile, just to perform a task. Getting a lot of value of jvisualvm could be quite difficult, but it might prove useful in some boundary scenarios.
... View more
- « Previous
- Next »