Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)

After completing this tutorial you will understand how to:

  • leverage Spark to infer a schema on a CSV dataset and persist it to Hive without explicitly declaring the DDL
  • deploy the Spark Thrift Server on the Hortonworks Sandbox
  • connect and ODBC tool (Tableau) to the Spark Thrift Server via the Hive ODBC driver, leveraging caching for ad-hoc visualization

Assumption 1: It is assumed that you have downloaded and deployed the Hortonworks sandbox, installed the Hive ODBC driver on your host machine, and installed Tableau (or your preferred ODBC-based reporting tool).

Assumption 2: Please ensure that your host machine's /etc/hosts file has the appropriate entry mapping sandbox.hortonworks.com to the IP of your sandbox (e.g., 172.16.35.171 sandbox.hortonworks.com sandbox).

Deploying the Spark Thrift Server

  • Within Ambari, click on the Hosts tab and then select the sandbox.hortonworks.com node from the list.
  • Now you can click “Add” and choose Spark Thrift Server from the list to deploy a thrift server.

    3621-1takj.png

  • After installing, start the thrift server via the service menu.

    3622-mrqox.png

Loading the Data

The code blocks below are each intended to be executed in their own Zeppelin notebook cells. Each cell begins with a '%' indicating the interpreter to be used.

  • Load the CSV reader dependency:
    %dep
    z.load("com.databricks:spark-csv_2.10:1.4.0")
     
  • Read the CSV file and infer the schema:
    %pyspark
    sqlContext = HiveContext(sc)
    data = sqlContext.read.load("/tmp/Crime_Data.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
    data.printSchema()
     
  • Persist the data to Hive:
    %pyspark
    data.registerAsTable("staging")
    sqlContext.sql("CREATE TABLE crimes STORED AS ORC AS SELECT * FROM staging")
     
  • Verify the data is present and able to be queried:
    %sql
    select Description, count(*) cnt from crimes
    group by Description order by cnt desc
     

Connecting Tableau via ODBC

  • Connect using the Hortonworks Hadoop Hive connector:

    3623-dphdz.png

  • Run the “Initial SQL” to cache the crimes table:

    3624-h2dbd.png

  • Verify the table is cached in the Thrift Server UI: http://sandbox.hortonworks.com:4040/storage/

    3625-dtrvg.png

  • Select the default schema and drag the crimes table into the tables area

    3626-llyum.png

  • Go to the worksheet and start exploring the data using the cached table!

    3627-fq6lp.png


e4hui.png
7,738 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 12:43 PM
Updated by:
 
Contributors
Top Kudoed Authors