Created on 04-23-2016 04:58 AM - edited 08-17-2019 12:43 PM
After completing this tutorial you will understand how to:
Assumption 1: It is assumed that you have downloaded and deployed the Hortonworks sandbox, installed the Hive ODBC driver on your host machine, and installed Tableau (or your preferred ODBC-based reporting tool).
Assumption 2: Please ensure that your host machine's /etc/hosts file has the appropriate entry mapping sandbox.hortonworks.com to the IP of your sandbox (e.g., 172.16.35.171 sandbox.hortonworks.com sandbox).
The code blocks below are each intended to be executed in their own Zeppelin notebook cells. Each cell begins with a '%' indicating the interpreter to be used.
%sh wget https://dl.dropboxusercontent.com/u/3136860/Crime_Data.csv hdfs dfs -put Crime_Data.csv /tmp head Crime_Data.csv
%dep z.load("com.databricks:spark-csv_2.10:1.4.0")
%pyspark sqlContext = HiveContext(sc) data = sqlContext.read.load("/tmp/Crime_Data.csv", format="com.databricks.spark.csv", header="true", inferSchema="true") data.printSchema()
%pyspark data.registerAsTable("staging") sqlContext.sql("CREATE TABLE crimes STORED AS ORC AS SELECT * FROM staging")
%sql select Description, count(*) cnt from crimes group by Description order by cnt desc
User | Count |
---|---|
763 | |
379 | |
316 | |
309 | |
270 |