Member since
09-14-2015
79
Posts
91
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2224 | 01-25-2017 04:43 PM | |
1713 | 11-23-2016 05:56 PM | |
5602 | 11-11-2016 02:44 AM | |
1496 | 10-26-2016 01:50 AM | |
9201 | 10-19-2016 10:22 PM |
12-09-2015
04:32 PM
Falcon can be used to mirror/replicate Hive tables to another cluster: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/section_mirroring_data_falcon.html
... View more
12-09-2015
04:30 PM
4 Kudos
I am unaware of an option to export a whole database but you could implement something quickly via a bash script. The rough pseudo-code would be: #!/bin/bash
tables = hive -e "use database_name;show tables;"
for table in tables
do
hive -e "EXPORT $table ..."
done
... View more
12-08-2015
07:06 PM
2 Kudos
Environment Best practices dictate that, where possible, a
Hadoop cluster should be maintained behind a firewall to minimize any
potential security vulnerabilities that may arise from exposed ports and
web interfaces. A common approach to enabling user access in this
situation is to open up SSH into a set of gateway/edge nodes. This
ensures that users must authenticate prior to accessing any pieces of the
Hadoop ecosystem and implicitly encrypts all data sent between the
client and the cluster. This is a common setup for vanilla cloud-based
installations.
The problem with this setup is that, by
default, all access is limited to the CLI on the gateway machines. Users
outside of the cluster firewall cannot access valuable features such as
web UIs, and JDBC/ODBC connections. There are a few options to securely
enable these capabilities: Enable Kerberos+SPNEGO and Knox. Then open up the appropriate ports in the firewall.
Implement firewall rules to expose specific ports and hosts to a subset of known client IPs. Leverage SSH tunneling to route traffic over an SSH connection and into the cluster. This
article focuses on #3. The best solution will vary on a case-by-case
basis but SSH tunneling is the simplest and requires no intervention by
OPs staff once SSH is enabled. Accessing Web UIs via a SOCKS Proxy You
can use SSH to open a local port that connects to a remote environment
and behaves like a SOCKS proxy. Once this tunnel is established, you can
configure your web browser to use the proxy and all web traffic will be
routed over the tunnel and into the cluster environment (behind the
firewall where the environment is open). The following command will open
a tunnel to the machine gateway.hdp.cluster which has SSH enabled: ssh -D 8080 -f -C -q -N username@gateway.hdp.cluster Parameters map to the following: -D the local port to listen on -f send this ssh operation into the background after password prompts
-C use compression -q quiet mode --> suppress warnings and diagnostic messages
-N do not execute remote command or wait for the user to provide any commands Once
the tunnel is established, you can open your web browser navigate to
the "Network Settings" tab. Under the proxy settings, enable the SOCKS
proxy and enter localhost and port 8080. Now all web traffic from your
browser will be routed over the tunnel and appear as if it is coming
from gateway.hdp.cluster. You should be able to load web UIs that are
behind the firewall such as Ambari or the Namenode UI. Establishing an ODBC/JDBC connection vi SSH Tunnel For
an ODBC/JDBC connection, the behavior we want is a bit different than
the previous sections. We want to map a local port to a port on a remote
machine within the firewall, specifically the HiveServer2 port. We can
do that as follows: ssh -L 10000:hiverserver2.hdp.cluster:10000 username@gateway.hdp.cluster
Now, an application on the client can connect to localhost on port 10000
and, to the application, it will appear as if it is connecting directly
to hiveserver2.hdp.local on port 10000. Under the covers data is
actually going over the SSH tunnel to gateway.hdp.cluster and then being
routed to port 10000 on the hiveserver2.hdp.cluster node. To
configure the ODBC/JDBC connection on the client simply use localhost
and port 10000 in place of the hiveserver2 host as part of the JDBC/ODBC
connection parameters.
... View more
Labels:
12-03-2015
06:21 PM
1 Kudo
I've attached a very rough template of an SFTP pipeline that does what you are looking for @Guilherme Braccialli. You could replace the initial GetSFTP processor with a GetFile processor and have pretty much the same functionality that you are looking for. It polls a directory looking for *.DONE files every 5 seconds. When it gets them it starts pushing them through the pipeline and encrypting/compressing them and dropping them off in HDFS and another SFTP directory. The "Keep Source File" property in the GetFile and GetSFTP processor allows you to delete the file after it is picked up so it isn't captured multiple times.
... View more
12-03-2015
06:13 PM
7 Kudos
Overview The objective of this article is to introduce the ESRI Spatial Framework for Hadoop and demonstrate how to use it with HDP 2.3.2. The following is a high-level view of how we will accomplish this:
set up the ESRI Spatial Framwork for Hadoop on the HDP 2.3.2 sandbox ETL data from the Open Baltimore Data Portal Execute simple geospatial queries to correlate crime data in Baltimore with spatial neighborhood data Prior to continuing the tutorial, please ensure you have downloaded and started the HDP Sandbox. ESRI Spatial Framework for Hadoop The ESRI Spatial Framework for Hadoop is a collection of Hive UDFs that allow users to perform complex spatial analysis directly in Hive. The framework has built-in support for representing geometric shapes (point, polygon, etc.) as well as functions that operate on these shapes. For example, users can perform a binary test for overlap between a pair of polygons or compute the geometry of the intersection. This framework provides a powerful method for stitching together datasets with geospatial features that, otherwise, may not have been able to be correlated. Setting up the framework on HDP The framework itself is open source so we need to clone the repository and build its dependencies in order to use it. This can be done as follows: git clone https://github.com/Esri/geometry-api-java.git
cd geometry-api-java
mvn clean install
cd ..
git clone https://github.com/Esri/geometry-api-java.git
cd spatial-framework-for-hadoop
mvn clean package The relevant libraries that we will use later are geometry-api-java/target/esri-geometry-api-1.2.1.jar and spatial-framework-for-hadoop/hive/target/spatial-sdk-hive-1.1.1-SNAPSHOT.jar. Please note there locations for later use. Data The data we will use is sourced from the Open Baltimore Data portal. I have linked (in the next section) data that I have already obtained and prepped but I include the steps in this section for completeness. There is no need to perform these steps if you use the attached files in which case you can proceed directly to the Importing Data to Hive section. Neighborhood Census Data The data relating to the 2010 census can be exported as an ESRI Shapefile from here. This includes a breakdown of demographics and, more importantly, shape data for the polygon that represents each neighborhood. The export zip archive contains several files but the *.shp file is the one we are after. The spatial framework supports a couple JSON formats so we need to convert the Shapefile before we can use it. On your sandbox, you can convert convert this Shapefile to GeoJson format with the following steps: yum install gdal
ogr2ogr -f "Geojson" -t_srs "WGS84" 2010_census.json 2010_Census_Profile_by_Neighborhood_Statistical_Areas.shp
Now, delete the first 3 lines and the last 2 lines of the file so it starts with: { "type": "Feature", "properties": ... This is the file that is attached here as 2010_census.json. We will load this into Hive in the next step. Crimes Data The crime data can be exported as a CSV file from here. There is no prep needed for this file. It is also attached as Crime_Data.csv Importing Data into Hive This section will walk through how to import the attached 2010_census.json and Crime_Data.csv files into HDFS and overlay a schema on them via Hive. The data will remain as-is on disk and the schema will be applied when the table is read. Transfer the two files to the sandbox via SCP or your preferred method. Then put them into the HDFS tmp directory: hdfs dfs -put 2010_census.json /tmp
hdfs dfs -put Crime_Data.csv /tmp Note that the 2.3.2 sandbox does not have a directory in HDFS for the root user and this can cause Hive queries to fail. To remedy this you need to issue the following commands as root: su hdfs
hdfs dfs -mkdir /user/root
hdfs dfs -chown root:hdfs /user/root
exit You will need the most recent version of the Hive JSON serde as well. You can get it as follows: wget https://github.com/sheetaldolas/Hive-JSON-Serde/a...
unzip json-serde-1.1.9.8.zip From Hive, you can now create the corresponding crimes and census tables and ingest them into a more optimized ORC format. Census Data We can overlay a schema on the Census JSON data suing a Hive JSON Serde. The Serde is responsible for telling Hive how to read and write in JSON format. Notice that the schema maps directly to the JSON structure you see in 2010_census.json. add jar Hive-JSON-Serde-json-serde-1.1.9.8/dist/json-serde-1.1.9.2-Hive13-jar-with-dependencies.jar;
CREATE TABLE census_text (
type string,
properties map<string,string>,
geometry string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
LOAD DATA INPATH '/tmp/2010_census.json' OVERWRITE INTO TABLE census_text;
Nest we will create an optimized ORC backed table and import the Census data. This process will leverage the JSON Serde to apply a schema when the JSON is read and then we will extract specific properties from the JSON and promote them to first-class fields in the ORC table. This will allow us to leverage predicate pushdown features of ORC and produce more efficient queries when these fields are used as conditions. Notice that the geometry is left as a plain string -- the spatial framework will read this at query time. set hive.execution.engine=mr;
-- A simple schema with several features promoted to fields in the table.
CREATE TABLE census_orc (
name string,
population double,
male double,
female double,
age_0_4 double,
age_5_11 double,
age_12_14 double,
age_15_17 double,
age_18_24 double,
age_25_34 double,
age_35_44 double,
age_45_64 double,
age_65_ovr double,
vacant double,
occupied double,
geometry string
)
STORED AS ORC;
-- We are casting many features from string to a more appropriate type of double as we
-- ingest into the ORC table.
INSERT INTO TABLE census_orc select properties['name'],
cast(properties['population'] as double),
cast(properties['Male'] as double),
cast(properties['Female'] as double),
cast(properties['AGE0_4'] as double),
cast(properties['AGE5_11'] as double),
cast(properties['AGE12_14'] as double),
cast(properties['AGE15_17'] as double),
cast(properties['AGE18_24'] as double),
cast(properties['AGE25_34'] as double),
cast(properties['AGE35_44'] as double),
cast(properties['AGE45_65'] as double),
cast(properties['AGE65ovr'] as double),
cast(properties['Vacant'] as double),
cast(properties['Occupied'] as double),
geometry from census_text where geometry != 'NULL'; Crime Data CREATE TABLE crimes_text(crime_date string,
code string,
location string,
description string,
weapon string,
post string,
district string,
neighborhood string,
coordinates string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/tmp/Crime_Data.csv' OVERWRITE INTO TABLE crimes_text;
-- Basic schema for and ORC backed table for crimes
CREATE TABLE crimes_orc(crime_date string,
code string, location string, description string, weapon string, post string, district string, neighborhood string, latitude double, longitude double)
STORED AS ORC;
-- load the plain CSV data into an optimized ORC file. The coordinates field is split
-- and the individual latitude and longitude is extracted into separate fields
INSERT INTO TABLE crimes_orc
SELECT crime_date, code, location, description, weapon, post, district, neighborhood, cast(substr(split(coordinates, ',')[0], 2) as double), cast(split(split(coordinates, ',')[1], '\\)')[0] as double)
FROM crimes_text; Example Queries Now that we have done all of the work to set up the framework and get the data prepped and ready we can finally run a couple sample queries. We will add the ESRI jars and make some simple aliases to the spatial functions that we need as follows: add jar geometry-api-java/target/esri-geometry-api-1.2.1.jar;
add jar spatial-framework-for-hadoop/hive/target/spatial-sdk-hive-1.1.1-SNAPSHOT.jar;
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
create temporary function ST_GeomFromGeoJson as 'com.esri.hadoop.hive.ST_GeomFromGeoJson';
One quick query we can run is to map crimes to a neighborhood via the ST_Contains relationship test and count the total per neighborhood. set hive.vectorized.execution.enabled = false;
set hive.execution.engine=tez;
select census_orc.name, count(*) cnt FROM
census_orc JOIN crimes_orc WHERE crimes_orc.description="HOMICIDE" and
ST_Contains(ST_GeomFromGeoJSON(census_orc.geometry), ST_Point(crimes_orc.longitude, crimes_orc.latitude))
GROUP BY census_orc.name ORDER by cnt desc LIMIT 10; Notice that this induces a full cross product to be computed between the
two tables. This is a very compute and time-intensive operation (taking
seconds on my Sandbox). It is more efficient to narrow down the data to
only those specific elements in which we are interested. This is why we have narrowed down the search to the most severe crimes and where the description field is labeled
'HOMICIDE'. This query runs in 266 seconds on my Sandbox. This is directly related to our efforts to trim the data with the available features to avoid unnecessary operations (especially expensive spatial operations). The output: Coldstream Homestead Montebello 44
Belair-Edison 38
Sandtown-Winchester 37
Central Park Heights 33
Frankford 32
Oliver 30
East Baltimore Midway 27
Broadway East 26
Upton 24
Brooklyn 23
Time taken: 266.383 seconds, Fetched: 10 row(s) Conclusion At this point we have set up the ESRI Spatial Framework for Hadoop and executed some simple queries. I encourage you to explore the API a bit more and see if you can discover anything within the data. I also encourage you to explore the other data sets provided by the city of Baltimore. There are many more data sets available that can be correlated to to these and lead to interesting results (for example: CCTV locations). Thanks to @David Kaiser for spending the time with me and providing several pointers and tips to get me up to speed on this space.
... View more
Labels:
11-20-2015
10:50 PM
An extra comment with an attachment because I was limited to 2 attachments in my original answer 🙂 Note that this was derived from an instance of HDP 2.2.4 with Ranger 0.4 but should translate over to Ranger 0.5 on HDP 2.3.
... View more
11-20-2015
10:49 PM
2 Kudos
Hi @rgarcia, I started putting together the schema myself but never completed it. I think I have sufficiently documented what you need though. See the attachments for a description of a few of the tables. It is worth noting that there are two databases in a default installation. the ranger_audit database contains all of the audit information regarding the HDP-based repositories (e.g., who touch what HDFS file, when, and was it allowed). The ranger database contains several tables pertaining to the metadata about each repo and the policies around them. The audit information around who changed the policies and when is also stored here. I also have couple example queries below to help understand them, specifically the ranger database which you need if you want to understand and audit how the policies themselves are changed: The following query will gather all of the resources that the usr ‘hive’ has been explicitly allowed access to: SELECT * FROM
(select x_resource.policy_name, x_user.user_name, x_resource.res_name, x_resource.res_cols, x_resource.res_dbs, x_resource.res_col_fams from x_resource
join x_perm_map
on x_perm_map.res_id = x_resource.id
join x_user
on x_user.id = x_perm_map.user_id) as policies
WHERE policies.user_name = 'hive’; The next will do the same for the group ‘hadoop’: SELECT * FROM
(select x_resource.policy_name, x_group.group_name, x_resource.res_name, x_resource.res_cols, x_resource.res_dbs, x_resource.res_col_fams from x_resource
join x_perm_map
on x_perm_map.res_id = x_resource.id
join x_group
on x_group.id = x_perm_map.group_id) as policies
WHERE policies.group_name = 'hadoop'; If you are looking for all users that have admin privileges somewhere within Hive, HDFS, or HBase then you can do it as follows: SELECT * FROM
(select x_resource.policy_name, x_user.user_name, x_resource.res_name, x_perm_map.perm_type from x_resource
join x_perm_map
on x_perm_map.res_id = x_resource.id
join x_user
on x_user.id = x_perm_map.user_id) as policies
WHERE policies.perm_type=6;
... View more
11-20-2015
05:25 PM
2 Kudos
One option that comes to mind is to leverage a custom InputFormat. HDFS doesn't care about where it breaks a file so the input format helps ensure that there are not awkward breaks between blocks when reading files. With this approach, you can define your own notion of a record, whether it be a line of text (TextInputFormat) or a window that could encapsulate multiple records. You can then use this custom InputFormat to read the data into An MR job or you can use it to develop you own custom Pig loader to work with your data in Pig. I am not personally aware of any libraries that have been built to address time-series specifically.
... View more
11-19-2015
03:41 AM
7 Kudos
The way I have done this is to first register a temp table in Spark and then leverage the sql method of the HiveContext to create a new table in hive using the data from the temp table. For example if I have a dataframe df and HiveContext hc the general process is: df.registerTempTable("my_temp_table")
hc.sql("CREATE TABLE new_table_name STORED AS ORC AS SELECT * from my_temp_table")
... View more
11-17-2015
10:23 PM
2 Kudos
Here are some resources that I have found usefule when interacting with Kerberos-enebaled services from Storm: Storm-HDFS with Kerberos: https://github.com/apache/storm/tree/master/external/storm-hdfs Storm-HBase with Kerberos: https://github.com/apache/storm/tree/master/external/storm-hbase Storm-Hive with Kerberos: https://github.com/apache/storm/tree/master/external/storm-hive
... View more
- « Previous
- Next »