About cstanca

cstanca · ‎07-29-2016

Introduction This article is a complement to Geo-spatial Queries with Hive using ESRI Geometry and Spatial Framework for Hadoop and includes a few more findings, mainly documenting the differences between ST_Geometry functions supported in Hive and those in commercial spatial packages for Oracle, SQL Server or Netezza. The Hive UDF's are modeled after existing implementations of ST_Geometry. Some functions exist only in Hive’s implementation, a few behave different or don’t exist. Additional ST_Geometry Functions in Hive ST_GeomFromJSON - Create a geometry from Esri JSON ST_AsJSON - Return Esri JSON representation of geometry ST_GeomFromGeoJSON - Create a geometry from GeoJSON ST_AsGeoJSON - Return GeoJSON representation of geometry (requires geometry-api-java version 1.1) ST_PointZ - constructor for 3D points from direct position ST_SetSRID - set the spatial-reference ID, for a geometry constructed without it ST_GeodesicLengthWGS84 - geodesic length in meters rather than in angles ST_Bin, ST_BinEnvelope - aggregate into regular grid, tutorial here ST_Geometry Functions with Different Behavior Overloaded constructors - These overloaded constructors differ from other ST_Geometry implementations in how the caller can specify the spatial-reference ID. Default SRID is plane, when the SRID is not specified. Hive does not accept SRID in second argument - wrap with ST_SetSRID or use ST_GeomFromText. Applies to ST_Point, ST_LineString, ST_Polygon, ST_MultiPoint, ST_MultiLineString, ST_MultiPolygon. ST_PointN - Return type varies in the case of index out of range - Hive: null; ST_AsText - The OGC WKT standard dictates that a MultiPoint is represented as MULTIPOINT ((1 2),(3 4)); however some existing WKT parsers accept only MULTIPOINT (1 2, 3 4). ST_AsText outputs the former, compliant format, with the nested parentheses. ST_Envelope - In the case of a point or a vertical or horizontal line, ST_Envelope may either apply a tolerance or return an empty envelope. ST_Intersection - In the case where the two geometries intersect in a lower dimension, ST_Intersection may drop the lower-dimension intersections, or output a closed linestring. ST_Intersection(ST_Polygon(2,0,3,1, 2,1), ST_Polygon(1,1, 4,1, 4,4, 1,4))) -- MULTIPOLYGON EMPTY or LINESTRING(2 1, 3 1, 2 1) ST_Union - may drop lower-dimension members of the union ST_Union(ST_LineString(2,3,4,5), ST_Point(1,1)) -- MULTILINESTRING ((2 3, 4 5)) ST_SymmetricDiff - Hive-spatial follows the naming in the Esri implemention of ST_Geometry. For the OGC naming, use an alias: create temporary function ST_SymDifference as 'com.esri.hadoop.hive.ST_SymmetricDiff'; ST_Geometry Functions Not Supported in Hive ST_GeomCollection, ST_NumGeometries, ST_GeometryN - collection of varying geometry types is not supported; hive supports arrays ST_Geometry - no constructor of this name - use one of the other constructors, e.g. ST_GeomFromText ST_Curve, ST_Surface, ST_MultiCurve, ST_MultiSurface - Curve and Surface constructors not supported ST_PointOnSurface - ST_PointOnSurface is not supported on Hive ST_GeoSize - ST_GeoSize is not supported on Hive ST_Transform - ST_Transform is not supported on Hive ST_Equalsrs - ST_Equalsrs is not supported on Hive Key Resources Complete list of functions supported in Hive: https://github.com/Esri/spatial-framework-for-hadoop/blob/master/hive/function-ddl.sql UDF documentation: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation Tips To share functions globally across sessions create them without “temporary” option. This has the advantage that you do not need to declare the functions for every session. create function ST_AsBinary as'com.esri.hadoop.hive.ST_AsBinary' You can also include the jar file in the create function statement which makes it easier to create a permanent declaration. For example, for the definition of the ST_Point function you would write the following SQL statement: create function ST_Point as ‘com.esri.hadoop.hive.ST_Point‘ using jar ‘hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-hive-1.1.1-SNAPSHOT.jar’; Final Notes As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially licensed packages. The best option to compensate for the 5-10% missing functions is to contribute to ESRI’s open source repository: https://github.com/Esri/spatial-framework-for-hadoop. ESRI does not provide a commercial library for Hive including all spatial functions. Thanks to Mansour Raad, Big Data Expert at ESRI for his candide and valuable input. Check Mansour's last article: Geospatial Big Data - the next big trend in analytics

cstanca · ‎07-14-2016

Multi-Polygon example: SELECT st_astext(ST_MultiPolygon('multipolygon (((0 0, 0 1, 1 0, 0 0)), ((2 2, 2 3, 3 2, 2 2)))')) from YourTable LIMIT 1;

cstanca · ‎07-09-2016

Objective The objective of this article is to provide an introduction to User-Defined Functions for spatial analysis in Hive. Credit This article is based on multiple ESRI resources referenced throughout the article. Thanks also to my colleague Artem Ervits for his contribution to the proof-of-concept that lead to this article. Spatial Types Spatial types are data types that store geometry data and they have associated functions or methods that are used to access and manipulate the data using Structured Query Language (SQL). Unlike Oracle, DB2, SQL Server, or PostGreSQL, Hive does not support yet a geometry native data type. The best approach is to store spatial data as text. That will not impact the use of UDFs available for Hive. ESRI Libraries The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience. The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc. A good description of ESRI Java Geometry Library here: https://github.com/Esri/geometry-api-java/wiki UDF Documentation: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation The following chart demonstrates the hierarchy of the ST_Geometry data type and its subclasses. ST_Geometry's subclasses are divided into two categories: the base geometry subclasses and the homogeneous collection subclasses. The base geometries include ST_Point, ST_LineString, and ST_Polygon, while the homogeneous collections include ST_MultiPoint, ST_MultiLineString, and ST_MultiPolygon. As the names imply, the homogeneous collections are collections of base geometries. In addition to sharing base geometry properties, homogeneous collections have some of their own properties. Each subclass stores the type of geometry implied by its name; for instance, ST_MultiPoint stores multipoints. A list of the subclasses and their descriptions are in the following table: Subtype Description ST_Point A zero-dimensional geometry that occupies a single location in coordinate space Has a single x,y coordinate value, is always simple, and has a NULL boundary ST_LineString A one-dimensional object stored as a sequence of points defining a linear interpolated path ST_LineStrings have length. The ST_LineString is simple if it does not intersect its interior. The endpoints (the boundary) of a closed ST_LineString occupy the same point in space. An ST_LineString is a ring if it is both closed and simple. The endpoints normally form the boundary of an ST_LineString unless the ST_LineString is closed, in which case the boundary is NULL. The interior of an ST_LineString is the connected path that lies between the endpoints, unless it is closed, in which case the interior is continuous. ST_Polygon A two-dimensional surface stored as a sequence of points defining its exterior bounding ring and zero or more interior rings ST_Polygon has area and is always simple. The exterior and any interior rings define the boundary of an ST_Polygon, and the space enclosed between the rings defines the ST_Polygon's interior. The rings of an ST_Polygon can intersect at a tangent point but never cross. ST_MultiPoint A collection of ST_Points Has a dimension of 0 An ST_MultiPoint is simple if none of its elements occupy the same coordinate space. The boundary of an ST_MultiPoint is NULL. ST_MultiLineString A collection of ST_LineStrings ST_MultiLineStrings have length. ST_MultiLineStrings are simple if they only intersect at the endpoints of the ST_LineString elements. ST_MultiLineStrings are nonsimple if the interiors of the ST_LineString elements intersect. The boundary of an ST_MultiLineString is the nonintersected endpoints of the ST_LineString elements. The ST_MultiLineString is closed if all of its ST_LineString elements are closed. The boundary of an ST_MultiLineString is NULL if all the endpoints of all the elements are intersected. ST_MultiPolygon A collection of polygons ST_MultiPolygons have area. The boundary of an ST_MultiPolygon is the cumulative length of its elements' exterior and interior rings. The interior of an ST_MultiPolygon is defined as the cumulative interiors of its element ST_Polygons. The boundary of an ST_MultiPolygon's elements can only intersect at a tangent point. Note that each subclass inherits the properties of the ST_Geometry superclass but also has properties of its own. Functions that operate on the ST_Geometry data type accept any of the subclass entity types. However, some functions have been defined at the subclass level and only accept certain subclasses. For example, the ST_GeometryN function only takes ST_MultiLinestring, ST_MultiPoint, or ST_MultiPolygon subtype values as input. To discover the subclass of an ST_Geometry, you can use the ST_GeometryType function. The ST_GeometryType function takes an ST_Geometry and returns the instantiated subclass in the form of a character string. To find out how many base geometry elements are contained in a homogeneous collection, you can use the ST_NumGeometries function, which takes a homogeneous collection and returns the number of base geometry elements it contains. Libraries For this demo, use the jar files located on the esri.zip file at https://github.com/cstanca1/hdp-hive-spatial. We built them from the latest source code modifying build.xml and pom.xml for both, https://github.com/Esri/spatial-framework-for-hadoop and the dependency library https://github.com/Esri/geometry-api-java, to reference Hive 1.2 and Hadoop 2.7, to match HDP 2.4.2 content. If you use Hive 1.1, you can download the binaries from https://github.com/Esri/spatial-framework-for-hadoop/releases Demo Steps Demo steps are also described at: https://github.com/cstanca1/hdp-hive-spatial 1) Clone the repo locally: git clone https://github.com/cstanca1/hdp-hive-spatial.git 2) Extract the 3 jars files from esri.zip and copy them to hdfs. Grant ownership on these jar files to hive:hdfs. 3) Connect to your hive database using your preferred CLI and execute the content of spatial-geometry-demo-ddl-dml.txt to create tables and populate with sample data. 4) Before executing any query, add ESRI jar libraries to your session execution path, for the session or globally. add jar hdfs://YourHDFSClientNode:8020/esri/esri-geometry-api.jar; add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-hive-1.1.1-SNAPSHOT.jar; add jar hdfs://YourHDFSClientNode:8020/esri/spatial-sdk-json-1.1.1-SNAPSHOT.jar; 5) Define temporary functions for better SQL-like experience. create temporary function st_geomfromtext as 'com.esri.hadoop.hive.ST_GeomFromText'; create temporary function st_geometrytype as 'com.esri.hadoop.hive.ST_GeometryType'; create temporary function st_point as 'com.esri.hadoop.hive.ST_Point'; create temporary function st_asjson as 'com.esri.hadoop.hive.ST_AsJson'; create temporary function st_asbinary as 'com.esri.hadoop.hive.ST_AsBinary'; create temporary function st_astext as 'com.esri.hadoop.hive.ST_AsText'; create temporary function st_intersects as 'com.esri.hadoop.hive.ST_Intersects'; create temporary function st_x as 'com.esri.hadoop.hive.ST_X'; create temporary function st_y as 'com.esri.hadoop.hive.ST_Y'; create temporary function st_srid as 'com.esri.hadoop.hive.ST_SRID'; create temporary function st_linestring as 'com.esri.hadoop.hive.ST_LineString'; create temporary function st_pointn as 'com.esri.hadoop.hive.ST_PointN'; create temporary function st_startpoint as 'com.esri.hadoop.hive.ST_StartPoint'; create temporary function st_endpoint as 'com.esri.hadoop.hive.ST_EndPoint'; create temporary function st_numpoints as 'com.esri.hadoop.hive.ST_NumPoints'; 6) Execute the various spatial queries included in spatial-geometry-demo-queries.txt file also listed below: Counts by geometry type -- assumes that other than ST_POINT values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape) from demo_shape_point group by st_geometrytype(st_geomfromtext(shape)); Counts by geometry type -- assumes that other than ST_LINESTRING values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape) from demo_shape_linestring group by st_geometrytype(st_geomfromtext(shape)); Counts by geometry type -- assumes that other than ST_POLYGON values are possible: select st_geometrytype(st_geomfromtext(shape)), count(shape) from demo_shape_polygon group by st_geometrytype(st_geomfromtext(shape)); X and Y coordinates of the point: select st_x(st_point(shape)) AS X, st_y(st_point(shape)) AS Y from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Extract geometry from text shape: select st_geomfromtext(shape) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Geometry type: select st_geometrytype(st_geomfromtext(shape)) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Point geometry as a binary - implicitly: select st_point(shape) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Point geometry as a binary - explicitly: select st_asbinary(st_geomfromtext(shape)) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Point geometry as Json: select st_asjson(st_geomfromtext(shape)) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Point geometry as a text: select st_astext(st_point(shape)) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; SRID for a point: select st_srid(st_point(shape)) from demo_shape_point where st_geometrytype(st_geomfromtext(shape)) = "ST_POINT" limit 1; Line as text: select st_astext(st_linestring(shape)) from demo_shape_linestring where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING" limit 1; n point of a line: select st_astext(st_point(st_astext(st_pointn(st_linestring(shape), 2)))) from demo_shape_linestring where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING" limit 1; Start and end points of a line: select st_astext(st_startpoint(st_linestring(shape))) AS StartPoint, st_astext(st_endpoint(st_linestring(shape))) AS EndPoint from demo_shape_linestring where st_geometrytype(st_geomfromtext(shape)) = "ST_LINESTRING" limit 1; Number of points in a polygon: select shape, st_numpoints(st_geomfromtext(shape)) as NumPoints from demo_shape_polygon where st_geometrytype(st_geomfromtext(shape))= "ST_POLYGON" limit 1; Lines intersection - usually you would have two tables - this is just an example with a table with one row: select st_intersects(a.s1, b.s2) from (select st_point(shape) AS `s1` from demo_shape_point limit 1) a join (select st_point(shape) AS `s2` from demo_shape_point limit 1) b limit 1; Lines intersection - usually you would have two tables - this is just an example with a table with one row: select st_intersects(a.s1, b.s2) from (select st_linestring(0,0, 1,1) AS `s1` from demo_shape_linestring limit 1) a join (select st_linestring(1,1, 0,0) AS `s2` from demo_shape_linestring limit 1) b limit 1; Conclusion The Esri Geometry API for Java and the Spatial Framework for Hadoop could be used by developers building geometry functions for various geo-spatial applications using Hive, Storm, Spark, HBase, anything from the big data ecosystem.

cstanca · ‎07-06-2016

@Alex McLintock You are correct. This is a local reference to a remote repo. Here is a reference for an actual local repo with no Internet access, but it gets complicated with Vagrant: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_setting_up_a_local_repository_with_no_internet_access.html

cstanca · ‎06-13-2016

Objective Deploy a 4-node HDP 2.4.2 cluster with Apache Ambari 2.2.2, Vagrant and VirtualBox on OS X host. This is helpful for development and proof of concepts. Scope This approach has been tested on OS X host, but it should work on all supported Vagrant and VirtualBox environments. Pre-requisites Minimum 9 GB of RAM for the HDP 2.4.2 cluster Download and install Vagrant for your host OS: https://www.vagrantup.com/downloads.html Download and install VirtualBox for your host OS: https://www.virtualbox.org/wiki/Downloads Download and install git client for your host Open a command shell and change to the folder where you plan to clone the github repository Clone the following git repository git clone https://github.com/cstanca1/hdp2_4_2-vagrant.git Create and Start VMs Change directory to /hdp_2.4.2-vagrant, the folder that includes Vagrantfile and create a /data folder: mkdir data This /data folder will be needed for guest VMs to share with the host. Vagrant (via Vagrantfile) is configured to use Centos 6.7 as the base box and includes the pre-requisites for installing HDP. 4 VMs will be created: 1 Ambari Server (ambari1), 1 Hadoop master (master1) and 2 slaves (slave1, slave2). vagrant up ambari1 Install and Setup Ambari Server Set a Local Reference to a Remote Ambari Repo vagrant ssh ambari1 sudo su - cd /etc/yum.repos.d wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.2.2.0/ambari.repo Setup SSH Access and Starting the Other 3 VMs Add at the same path with files downloaded from the repoosity, your id_rsa and id_rsa.pub keys (seehttps://wiki.centos.org/HowTos/Network/SecuringSSH section 7 for instructions on CentOS). You could perform these steps on ambari1 VM and copy these two files to your /vagrant_data folder which shares data between guest and host. Only after you copy those two files, start the other three VMs: vagrant up master1 vagrant up slave1 vagrant up slave2 Install Ambari Server yum install ambari-server Setup Ambari Server Run the setup command to configure your Ambari Server, Database, JDK, LDAP, and other options: ambari-server setup Start Ambari Server ambari-server start Deploy Cluster using Ambari Web UI Open up a web browser and go to: http://ambari1:8080 Log in with username admin and password admin and follow on-screen instructions, using hosts created and selecting services of interest. For more details, see "Automated Install" at: https://docs.hortonworks.com/HDPDocuments/Ambari/Ambari-2.2.2.0/index.html

cstanca · ‎06-11-2016

@Armando Segnini Thank you so much for your review. Your findings were spot-on. I had a few typos and omitted a mv command. Excellent catches.

cstanca · ‎06-01-2016

Introduction Ambari 2.2.2 provides a better understanding of cluster health and performance metrics through advanced visualizations and pre-built dashboards, isolating critical metrics for core cluster services such as Kafka reducing time to troubleshoot problems, and improving the level of service for cluster tenants. Ambari Metrics System (AMS) has also a new API to discover metrics. Only HDFS, YARN and HBASE default dashboards are included along with System metrics. Ambari 2.2.2 comes with built-in Grafana integration. Kafka default Grafana dashboard is targeted for Ambari 2.4. Until then, you can build a custom dashboard to see over 30 Kafka BrokerTopicMetrics using Grafana. Tutorial 1. Install Ambari 2.2.2 or Upgrade to Ambari 2.2.2. On upgrade, to enable Grafana, follow instructions from Upgrade to Ambari 2.2.2. 2. Automated Install of HDP 2.4 with Ambari 2.2.2, including Kafka 0.9.0.2.4. Grafana is added automatically on a new install. 3. Explore Ambari Kafka Metrics that can be accessed as: http://<ams-host>:6188/ws/v1/timeline/metrics/metadata/ 4. Add Grafana to Ambari Metrics. 5. Access Grafana to see out-of-box dashboards. Port 3000 is the default for Grafana UI. This is view-only. Built-in Grafana does not allow creation of new dashboards. 6. A few Kafka metrics are displayed by default, however, to see more Kafka metrics, click on Kafka link on the left nav, then click the big "+" sign to add a new widget for one or many metrics. Select a widget from Widget Browser window: Click on "Create Widget", select a widget type, let's say "Graph", "Add Metric" Kafka/ All Kafka Brokers, add a metric, for example kafka.server.BrokerTopicMetrics.BytestInPerSec.1MinuteRate and select the aggregation type. Follow screen instructions and save. The new widget will be added to Kafka metrics dashboard. Repeat the steps for all Kafka Broker Topic Metrics, as necessary. Unfortunately, you can't add a widget for a specific topic or a specific broker, at least not yet. 7. To create a custom dashboard adding desired Kafka topic & broker metrics, a separate Grafana installation is needed. This could be on your development machine. Follow instructions to build a custom dashboard. After complete, deploy on the server. Conclusion Ambari 2.2.2 metrics dashboard can provide a reasonable insight on all brokers metrics. Combined with Burrow and the promise for more Kafka metrics in the next versions of Ambari, the near future seems promising for Kafka monitoring.

cstanca · ‎05-27-2016

@Vladimir Zlatkin Updated section "NUMA optimization" to include a link to OS CPU optimizations for RHEL. "Spark applications performance could be improved not only by configuring various Spark parameters and JVM options, but also using the operating system side optimizations, e.g. CPU affinity, NUMA policy, hardware performance policy etc. to take advantage of the most recent hardware NUMA capable." The referenced section is vast. We could qualify some of the settings to best choices. This could be a follow-up article if it part 1 presented real interest.

cstanca · ‎05-24-2016

Section 2.7. You probably meant /lib instead of /bin. That's where nar files are deployed.

cstanca · ‎05-24-2016

In production, for remote access, you would have to deal with firewall issues, however, for special cases when high severity issues troubleshooting is needed, Ops folks may agree to perform the needed changes. Additionally, you need to start the JVM with something like this in order to be able to truly access the JVM remotely (from a different host): -Djava.rmi.server.hostname = host ip , which forces RMI service to use the host ip instead of 127.0.0.1. By the nature of the Hadoop beast, most of the tools in the ecosystem would have multiple JVMs and some of them would be volatile, just to perform a task. Getting a lot of value of jvisualvm could be quite difficult, but it might prove useful in some boundary scenarios.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1727

Cloudera Community

Hive UDFs vs Spatial SQL

Re: Geo-spatial Queries with Hive using ESRI Geome...

Geo-spatial Queries with Hive using ESRI Geometry ...

Re: Setup Hortonworks Data Platform using Vagrant,...

Setup Hortonworks Data Platform using Vagrant, Vir...

Re: Monitoring Kafka with Burrow - Part 1

Kafka Topics and Brokers Metrics with Ambari 2.2.2

Re: Apache Spark Performance Improvement on NUMA C...

Re: Build Custom Nifi Processor

Re: Monitor Hadoop JVMs with jVisualVM