Community Articles

Find and share helpful community-sourced technical articles.
avatar
Rising Star

This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1).

Note: you must have admin privileges on your datahub cluster do this configuration.

Step 1: Whitelist the path to the HWC jar

  1. In your CDP Datahub, open Cloudera Manager (CM) via the link in the management console:
    Screen Shot 2020-05-05 at 7.06.09 PM.png
  2. In CM, go to your cluster > Livy > Configuration, and search for livy-conf/livy.conf:
    Screen Shot 2020-05-05 at 7.11.19 PM.png
  3. Add the following safety valve to the configuration file:
    livy.file.local-dir-whitelist=/path_for_hwc/​
     In our example, we are using the /tmp/ folder.
  4. Restart the Livy service via CM to propagate the configuration.

Step 2: Copy the HWC jar to the whitelisted location

  1. Find the hostname of the node where Livy is installed (master3 here):
    Screen Shot 2020-05-05 at 7.22.07 PM.png
  2. Connect to the node using your user/workload password, e.g.:
    ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site​
  3. Find the HWC jar:
    [pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null
    ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
    ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar​
  4. Copy it and add the right permissions:
    cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp
    chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar​

Step 3: Add jar path to Zeppelin Livy interpreter

  1. From your management console, open Zeppelin: 
    Screen Shot 2020-05-05 at 7.28.38 PM.png
  2. Go to the top right, and configure your Interpreters:
    Screen Shot 2020-05-05 at 7.28.13 PM.png
  3. Edit the livy interpreter and add the following properties:
    HWC Jar Location
    Name: livy.spark.jars
    Value: file:///[LOCATION_OF_YOUR_HWC_JAR]
    Hive JDBC URL
    Name: livy.spark.sql.hive.hiveserver2.jdbc.url
    Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password]
  4. You can find the JDBC URL in your datahub management console:
    Screen Shot 2020-05-05 at 7.33.15 PM.png
    Read via LLAP
    Name: livy.spark.datasource.hive.warehouse.read.via.llap
    Value: false
    JDBC mode
    Name: livy.spark.datasource.hive.warehouse.read.jdbc.mode
    Value: client
    Staging Dir
    Name: livy.spark.datasource.hive.warehouse.load.staging.dir
    Value: /tmp
    Metastore URI 
    Name: livy.spark.datasource.hive.warehouse.metastoreUri
    Value: [VALUE_FROM_HIVE_SITE_XML]
  5. You can download hive-site.xml from CM, by going to Your Cluster > Hive > Download Client Configuration:
    Screen Shot 2020-05-05 at 9.05.29 PM.png
  6. Save your configuration, and restart your interpreter.

Step 4: Code away

Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC:

Read raw location data

val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv")
locationDf.printSchema()

Setup HWC session

import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()

Create database and save dataset to table

hive.executeUpdate("CREATE DATABASE worldwidebank");
hive.setDatabase("worldwidebank");
locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save()

Query data

val ds = hive.sql("select * from locations limit 10")
ds.show()
2,599 Views