Options
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Rising Star
Created on 05-05-2020 06:11 PM - edited on 06-08-2020 02:31 AM by VidyaSargur
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1).
Note: you must have admin privileges on your datahub cluster do this configuration.
Step 1: Whitelist the path to the HWC jar
- In your CDP Datahub, open Cloudera Manager (CM) via the link in the management console:
- In CM, go to your cluster > Livy > Configuration, and search for livy-conf/livy.conf:
- Add the following safety valve to the configuration file:
In our example, we are using the /tmp/ folder.livy.file.local-dir-whitelist=/path_for_hwc/
- Restart the Livy service via CM to propagate the configuration.
Step 2: Copy the HWC jar to the whitelisted location
- Find the hostname of the node where Livy is installed (master3 here):
- Connect to the node using your user/workload password, e.g.:
ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site
- Find the HWC jar:
[pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
- Copy it and add the right permissions:
cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
Step 3: Add jar path to Zeppelin Livy interpreter
- From your management console, open Zeppelin:
- Go to the top right, and configure your Interpreters:
- Edit the livy interpreter and add the following properties:
HWC Jar Location
Hive JDBC URLName: livy.spark.jars Value: file:///[LOCATION_OF_YOUR_HWC_JAR]
Name: livy.spark.sql.hive.hiveserver2.jdbc.url Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password]
- You can find the JDBC URL in your datahub management console:
Read via LLAP
JDBC modeName: livy.spark.datasource.hive.warehouse.read.via.llap Value: false
Staging DirName: livy.spark.datasource.hive.warehouse.read.jdbc.mode Value: client
Metastore URIName: livy.spark.datasource.hive.warehouse.load.staging.dir Value: /tmp
Name: livy.spark.datasource.hive.warehouse.metastoreUri Value: [VALUE_FROM_HIVE_SITE_XML]
- You can download hive-site.xml from CM, by going to Your Cluster > Hive > Download Client Configuration:
- Save your configuration, and restart your interpreter.
Step 4: Code away
Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC:
Read raw location data
val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv")
locationDf.printSchema()
Setup HWC session
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
Create database and save dataset to table
hive.executeUpdate("CREATE DATABASE worldwidebank");
hive.setDatabase("worldwidebank");
locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save()
Query data
val ds = hive.sql("select * from locations limit 10")
ds.show()
2,599 Views