Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Rising Star

This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud (tested with CDP Public Cloud Datahub runtime 7.1).

Note: you must have admin privileges on your datahub cluster do this configuration.

Step 1: Whitelist the path to the HWC jar

  1. In your CDP Datahub, open Cloudera Manager (CM) via the link in the management console:
    Screen Shot 2020-05-05 at 7.06.09 PM.png
  2. In CM, go to your cluster > Livy > Configuration, and search for livy-conf/livy.conf:
    Screen Shot 2020-05-05 at 7.11.19 PM.png
  3. Add the following safety valve to the configuration file:
    livy.file.local-dir-whitelist=/path_for_hwc/​
     In our example, we are using the /tmp/ folder.
  4. Restart the Livy service via CM to propagate the configuration.

Step 2: Copy the HWC jar to the whitelisted location

  1. Find the hostname of the node where Livy is installed (master3 here):
    Screen Shot 2020-05-05 at 7.22.07 PM.png
  2. Connect to the node using your user/workload password, e.g.:
    ssh pvidal@viz-data-engineering-master3.viz-cdp.a465-9q4k.cloudera.site​
  3. Find the HWC jar:
    [pvidal@viz-data-engineering-master3 /]$ find / -name *hive-warehouse-connector* 2>/dev/null
    ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar
    ./opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar​
  4. Copy it and add the right permissions:
    cp /opt/cloudera/parcels/CDH-7.1.0-1.cdh7.1.0.p0.1922354/jars/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar /tmp
    chmod a+rw /tmp/hive-warehouse-connector-assembly-1.0.0.7.1.0.0-714.jar​

Step 3: Add jar path to Zeppelin Livy interpreter

  1. From your management console, open Zeppelin: 
    Screen Shot 2020-05-05 at 7.28.38 PM.png
  2. Go to the top right, and configure your Interpreters:
    Screen Shot 2020-05-05 at 7.28.13 PM.png
  3. Edit the livy interpreter and add the following properties:
    HWC Jar Location
    Name: livy.spark.jars
    Value: file:///[LOCATION_OF_YOUR_HWC_JAR]
    Hive JDBC URL
    Name: livy.spark.sql.hive.hiveserver2.jdbc.url
    Value: [JDBC_URL_FROM_MANAGEMENT_CONSOLE];user=[your_user];password=[your_password]
  4. You can find the JDBC URL in your datahub management console:
    Screen Shot 2020-05-05 at 7.33.15 PM.png
    Read via LLAP
    Name: livy.spark.datasource.hive.warehouse.read.via.llap
    Value: false
    JDBC mode
    Name: livy.spark.datasource.hive.warehouse.read.jdbc.mode
    Value: client
    Staging Dir
    Name: livy.spark.datasource.hive.warehouse.load.staging.dir
    Value: /tmp
    Metastore URI 
    Name: livy.spark.datasource.hive.warehouse.metastoreUri
    Value: [VALUE_FROM_HIVE_SITE_XML]
  5. You can download hive-site.xml from CM, by going to Your Cluster > Hive > Download Client Configuration:
    Screen Shot 2020-05-05 at 9.05.29 PM.png
  6. Save your configuration, and restart your interpreter.

Step 4: Code away

Here is a simple example of Spark reading from a CSV and writing to a Hive table using HWC:

Read raw location data

val locationDf = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv("s3a://viz-cdp-bucket/raw/locations.csv")
locationDf.printSchema()

Setup HWC session

import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()

Create database and save dataset to table

hive.executeUpdate("CREATE DATABASE worldwidebank");
hive.setDatabase("worldwidebank");
locationDf.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "locations").save()

Query data

val ds = hive.sql("select * from locations limit 10")
ds.show()
2,250 Views