Created on 04-23-201604:58 AM - edited 08-17-201912:43 PM
After completing this tutorial you will understand how to:
leverage Spark to infer a schema on a CSV dataset and persist it to Hive without explicitly declaring the DDL
deploy the Spark Thrift Server on the
Hortonworks Sandbox
connect and ODBC tool (Tableau) to the Spark Thrift Server via
the Hive ODBC driver, leveraging caching for ad-hoc visualization
Assumption 1: It is assumed that you have downloaded and deployed the Hortonworks sandbox, installed the Hive ODBC driver on your host machine, and installed
Tableau (or your preferred ODBC-based reporting tool).
Assumption 2: Please ensure that your host machine's /etc/hosts file has the appropriate entry mapping sandbox.hortonworks.com to the IP of your sandbox (e.g., 172.16.35.171 sandbox.hortonworks.com sandbox).
Deploying the Spark Thrift Server
Within Ambari, click on the Hosts tab and then
select the sandbox.hortonworks.com node from the list.
Now you can click “Add” and choose Spark Thrift
Server from the list to deploy a thrift server.
After installing, start the thrift server via
the service menu.
Loading
the Data
The code blocks below are each intended to be executed in their own Zeppelin notebook cells. Each cell begins with a '%' indicating the interpreter to be used.