Created on
05-05-2020
06:11 PM
- edited on
06-08-2020
02:31 AM
by
VidyaSargur
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1).
Note: you must have admin privileges on your datahub cluster do this configuration.
livy.file.local-dir-whitelist=/path_for_hwc/
In our example, we are using the /tmp/ folder.ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site
[pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null
./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp
chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
Name: livy.spark.jars
Value: file:///[LOCATION_OF_YOUR_HWC_JAR]
Hive JDBC URLName: livy.spark.sql.hive.hiveserver2.jdbc.url
Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password]
Name: livy.spark.datasource.hive.warehouse.read.via.llap
Value: false
JDBC modeName: livy.spark.datasource.hive.warehouse.read.jdbc.mode
Value: client
Staging DirName: livy.spark.datasource.hive.warehouse.load.staging.dir
Value: /tmp
Metastore URI Name: livy.spark.datasource.hive.warehouse.metastoreUri
Value: [VALUE_FROM_HIVE_SITE_XML]
Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC:
Read raw location data
val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv")
locationDf.printSchema()
Setup HWC session
import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
Create database and save dataset to table
hive.executeUpdate("CREATE DATABASE worldwidebank");
hive.setDatabase("worldwidebank");
locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save()
Query data
val ds = hive.sql("select * from locations limit 10")
ds.show()