Community Articles

Find and share helpful community-sourced technical articles.
Welcome to the upgraded Community! Read this blog to see What’s New!
Labels (1)

After completing this tutorial you will understand how to:

  • leverage Spark to infer a schema on a CSV dataset and persist it to Hive without explicitly declaring the DDL
  • deploy the Spark Thrift Server on the Hortonworks Sandbox
  • connect and ODBC tool (Tableau) to the Spark Thrift Server via the Hive ODBC driver, leveraging caching for ad-hoc visualization

Assumption 1: It is assumed that you have downloaded and deployed the Hortonworks sandbox, installed the Hive ODBC driver on your host machine, and installed Tableau (or your preferred ODBC-based reporting tool).

Assumption 2: Please ensure that your host machine's /etc/hosts file has the appropriate entry mapping to the IP of your sandbox (e.g., sandbox).

Deploying the Spark Thrift Server

  • Within Ambari, click on the Hosts tab and then select the node from the list.
  • Now you can click “Add” and choose Spark Thrift Server from the list to deploy a thrift server.


  • After installing, start the thrift server via the service menu.


Loading the Data

The code blocks below are each intended to be executed in their own Zeppelin notebook cells. Each cell begins with a '%' indicating the interpreter to be used.

  • Load the CSV reader dependency:
  • Read the CSV file and infer the schema:
    sqlContext = HiveContext(sc)
    data ="/tmp/Crime_Data.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
  • Persist the data to Hive:
    sqlContext.sql("CREATE TABLE crimes STORED AS ORC AS SELECT * FROM staging")
  • Verify the data is present and able to be queried:
    select Description, count(*) cnt from crimes
    group by Description order by cnt desc

Connecting Tableau via ODBC

  • Connect using the Hortonworks Hadoop Hive connector:


  • Run the “Initial SQL” to cache the crimes table:


  • Verify the table is cached in the Thrift Server UI:


  • Select the default schema and drag the crimes table into the tables area


  • Go to the worksheet and start exploring the data using the cached table!


Version history
Last update:
‎08-17-2019 12:43 PM
Updated by: