Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Explorer

This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud CML (tested with CDP Public Cloud runtime 7.1).

Step 1 - Start a Virtual Warehouse

  1. Navigate to Data Warehouses in CDP.
  2. If you haven't activated your environment, ensure you have activated it first.
  3. Once activated, provision a new Hive Virtual Warehouse:Screen Shot 2020-05-19 at 12.40.27 PM.png

Step 2 - Find your HiveServer2 JDBC URL

  1. Once your Virtual Warehouse has been created, find it in the list of Virtual Warehouses.
  2. Click the Options icon (three vertical dots) for your Virtual Warehouse.
  3. Select the Copy JDBC URL option and note what is copied to the clipboard:Screen Shot 2020-05-19 at 12.41.40 PM.png

Step 3 - Find your Hive Metastore URI

  1. Navigate back to your environment Overview page and select the Data Lake tab.
  2. Click on the CM-UI Service:Screen Shot 2020-05-19 at 12.46.49 PM.png
  3. Click on the Options icon (three vertical dots) for the Hive Metastore Service and click Configuration:
    Screen Shot 2020-05-19 at 12.48.19 PM.png
  4. Click the Actions dropdown menu and select the Download Client Configuration item:Screen Shot 2020-05-19 at 12.49.46 PM.png
  5. Extract the downloaded zip file, open the hive-site.xml file, and find the value for the hive.metastore.uris configuration. Make a note this value.

Step 4 - Find the Hive Warehouse Connector Jar

A recent update has made this step unnecessary.

  1. Navigate to your CML Workspace, select your Project, and launch a Python Workbench session.
  2. In either the terminal or Workbench, find the hive warehouse connector jar (HWC):

 

 

 

find / -name *hive-warehouse-connector*

 

 

 

*Note: CML no longer allows "access" to /dev/null - so redirecting errors to that location no longer works. The above command will contain a lot of "Permission Denied" output, but the jar you're looking for should be somewhere mixed in - likely in /usr/lib.

Step 5 - Configure your Spark Session

The Overview Page for the Hive Warehouse Connector provides details and current limitations. The Configuration Page details the two modes described below. Note: fine-grained Ranger access controls are bypassed in the High-Performance Read Mode (i.e. LLAP / Cluster Mode).

JDBC / Client Mode

 

 

from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
spark = SparkSession\
    .builder\
    .appName("PythonSQL-Client")\
    .master("local[*]")\
    .config("spark.yarn.access.hadoopFileSystems","s3a:///[STORAGE_LOCATION]")\
    .config("spark.hadoop.yarn.resourcemanager.principal", "[Your_User]")\
    .config("spark.sql.hive.hiveserver2.jdbc.url", "[VIRTUAL_WAREHOUSE_HS2_JDBC_URL];user=[Your_User];password=[Your_Workload_Password]")\
    .config("spark.datasource.hive.warehouse.read.via.llap", "false")\
    .config("spark.datasource.hive.warehouse.read.jdbc.mode", "client")\
    .config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")\
    .config("spark.datasource.hive.warehouse.load.staging.dir", "/tmp")\
//No longer necessary    .config("spark.jars", "[HWC_Jar_Location]")\
    .getOrCreate()

hive = HiveWarehouseSession.session(spark).build()

 

 

 

LLAP / Cluster Mode

Note: LLAP / Cluster Mode doesn't require the HiveWarehouseSession, though you are free to use it for consistency between the modes. 

 

 

from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
spark = SparkSession\
    .builder\
    .appName("PythonSQL-Cluster")\
    .master("local[*]")\
    .config("spark.yarn.access.hadoopFileSystems","s3a:///[STORAGE_LOCATION]")\
    .config("spark.hadoop.yarn.resourcemanager.principal", "[Your_User]")\
    .config("spark.sql.hive.hiveserver2.jdbc.url", "[VIRTUAL_WAREHOUSE_HS2_JDBC_URL];user=[Your_User];password=[Your_Workload_Password]")\
    .config("spark.datasource.hive.warehouse.read.via.llap", "true")\
    .config("spark.datasource.hive.warehouse.read.jdbc.mode", "cluster")\
    .config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")\
    .config("spark.datasource.hive.warehouse.load.staging.dir", "/tmp")\
//No longer necessary    .config("spark.jars", "[HWC_Jar_Location]")\
    .config("spark.sql.hive.hwc.execution.mode", "spark")\
    .config("spark.sql.extensions", "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension")\
    .getOrCreate()

hive = HiveWarehouseSession.session(spark).build()

 

 

 

Step 6 - SQL All the Things

Add your Spark SQL...

JDBC / Client Mode

 

 

 

 

 

 

from pyspark.sql.types import *

#This table has column masking and row level filters in Ranger. The below query, using the HWC, has the policies applied.
hive.sql("select * from masking.customers").show()
#This query, using plain spark sql, will not have the column masking or row level filter policies applied.
spark.sql("select * from masking.customers").show()

 

 

 

 

 

 

LLAP / Cluster Mode

 

 

 

 

 

 

from pyspark.sql.types import *

#This table has column masking and row level filters in Ranger. Neither are applied in the below due to LLAP/Cluster Mode High Performance Reads
hive.sql("select * from masking.customers").show()
spark.sql("select * from masking.customers").show()

 

 

 

 

 

 

2,787 Views