Member since
09-14-2015
79
Posts
91
Kudos Received
22
Solutions
04-23-2016
04:58 AM
7 Kudos
After completing this tutorial you will understand how to:
leverage Spark to infer a schema on a CSV dataset and persist it to Hive without explicitly declaring the DDL
deploy the Spark Thrift Server on the
Hortonworks Sandbox
connect and ODBC tool (Tableau) to the Spark Thrift Server via
the Hive ODBC driver, leveraging caching for ad-hoc visualization
Assumption 1: It is assumed that you have downloaded and deployed the Hortonworks sandbox, installed the Hive ODBC driver on your host machine, and installed
Tableau (or your preferred ODBC-based reporting tool).
Assumption 2: Please ensure that your host machine's /etc/hosts file has the appropriate entry mapping sandbox.hortonworks.com to the IP of your sandbox (e.g., 172.16.35.171 sandbox.hortonworks.com sandbox). Deploying the Spark Thrift Server
Within Ambari, click on the Hosts tab and then
select the sandbox.hortonworks.com node from the list.
Now you can click “Add” and choose Spark Thrift
Server from the list to deploy a thrift server.
After installing, start the thrift server via
the service menu.
Loading
the Data
The code blocks below are each intended to be executed in their own Zeppelin notebook cells. Each cell begins with a '%' indicating the interpreter to be used.
Open Zeppelin and create a new notebook: http://sandbox.hortonworks.com:9995
Download and take a peek at the first few lines
of the data:
%sh
wget https://dl.dropboxusercontent.com/u/3136860/Crime_Data.csv
hdfs dfs -put Crime_Data.csv /tmp
head Crime_Data.csv
Load the CSV reader dependency:
%dep
z.load("com.databricks:spark-csv_2.10:1.4.0")
Read the CSV file and infer the schema:
%pyspark
sqlContext = HiveContext(sc)
data = sqlContext.read.load("/tmp/Crime_Data.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
data.printSchema()
Persist the data to Hive:
%pyspark
data.registerAsTable("staging")
sqlContext.sql("CREATE TABLE crimes STORED AS ORC AS SELECT * FROM staging")
Verify the data is present and able to be
queried:
%sql
select Description, count(*) cnt from crimes
group by Description order by cnt desc
Connecting
Tableau via ODBC
Connect using the Hortonworks Hadoop Hive
connector:
Run the “Initial SQL” to cache the crimes table:
Verify the table is cached in the Thrift Server
UI: http://sandbox.hortonworks.com:4040/storage/
Select the default schema and drag the crimes
table into the tables area
Go to the worksheet and start exploring the
data using the cached table!
... View more
Labels:
12-08-2015
07:06 PM
2 Kudos
Environment Best practices dictate that, where possible, a
Hadoop cluster should be maintained behind a firewall to minimize any
potential security vulnerabilities that may arise from exposed ports and
web interfaces. A common approach to enabling user access in this
situation is to open up SSH into a set of gateway/edge nodes. This
ensures that users must authenticate prior to accessing any pieces of the
Hadoop ecosystem and implicitly encrypts all data sent between the
client and the cluster. This is a common setup for vanilla cloud-based
installations.
The problem with this setup is that, by
default, all access is limited to the CLI on the gateway machines. Users
outside of the cluster firewall cannot access valuable features such as
web UIs, and JDBC/ODBC connections. There are a few options to securely
enable these capabilities: Enable Kerberos+SPNEGO and Knox. Then open up the appropriate ports in the firewall.
Implement firewall rules to expose specific ports and hosts to a subset of known client IPs. Leverage SSH tunneling to route traffic over an SSH connection and into the cluster. This
article focuses on #3. The best solution will vary on a case-by-case
basis but SSH tunneling is the simplest and requires no intervention by
OPs staff once SSH is enabled. Accessing Web UIs via a SOCKS Proxy You
can use SSH to open a local port that connects to a remote environment
and behaves like a SOCKS proxy. Once this tunnel is established, you can
configure your web browser to use the proxy and all web traffic will be
routed over the tunnel and into the cluster environment (behind the
firewall where the environment is open). The following command will open
a tunnel to the machine gateway.hdp.cluster which has SSH enabled: ssh -D 8080 -f -C -q -N username@gateway.hdp.cluster Parameters map to the following: -D the local port to listen on -f send this ssh operation into the background after password prompts
-C use compression -q quiet mode --> suppress warnings and diagnostic messages
-N do not execute remote command or wait for the user to provide any commands Once
the tunnel is established, you can open your web browser navigate to
the "Network Settings" tab. Under the proxy settings, enable the SOCKS
proxy and enter localhost and port 8080. Now all web traffic from your
browser will be routed over the tunnel and appear as if it is coming
from gateway.hdp.cluster. You should be able to load web UIs that are
behind the firewall such as Ambari or the Namenode UI. Establishing an ODBC/JDBC connection vi SSH Tunnel For
an ODBC/JDBC connection, the behavior we want is a bit different than
the previous sections. We want to map a local port to a port on a remote
machine within the firewall, specifically the HiveServer2 port. We can
do that as follows: ssh -L 10000:hiverserver2.hdp.cluster:10000 username@gateway.hdp.cluster
Now, an application on the client can connect to localhost on port 10000
and, to the application, it will appear as if it is connecting directly
to hiveserver2.hdp.local on port 10000. Under the covers data is
actually going over the SSH tunnel to gateway.hdp.cluster and then being
routed to port 10000 on the hiveserver2.hdp.cluster node. To
configure the ODBC/JDBC connection on the client simply use localhost
and port 10000 in place of the hiveserver2 host as part of the JDBC/ODBC
connection parameters.
... View more
Labels:
12-18-2015
07:21 AM
1 Kudo
If you are exploring GeoSpatial processing with Spark you should check out Magellan.
... View more