Created on 05-19-202012:22 PM - edited on 04-21-202604:29 AM by GrazittiAPI
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud CML (tested with CDP Public Cloud runtime 7.1).
Step 1 - Start a Virtual Warehouse
Navigate to Data Warehouses in CDP.
If you haven't activated your environment, ensure you have activated it first.
Once activated, provision a new Hive Virtual Warehouse:
Step 2 - Find your HiveServer2 JDBC URL
Once your Virtual Warehouse has been created, find it in the list of Virtual Warehouses.
Click the Options icon (three vertical dots) for your Virtual Warehouse.
Select the Copy JDBC URL option and note what is copied to the clipboard:
Step 3 - Find your Hive Metastore URI
Navigate back to your environment Overview page and select the Data Lake tab.
Click on the CM-UI Service:
Click on the Options icon (three vertical dots) for the Hive Metastore Service and click Configuration:
Click the Actions dropdown menu and select the Download Client Configuration item:
Extract the downloaded zip file, open the hive-site.xml file, and find the value for the hive.metastore.uris configuration. Make a note this value.
Step 4 - Find the Hive Warehouse Connector Jar
A recent update has made this step unnecessary.
Navigate to your CML Workspace, select your Project, and launch a Python Workbench session.
In either the terminal or Workbench, find the hive warehouse connector jar (HWC):
find / -name *hive-warehouse-connector*
*Note: CML no longer allows "access" to /dev/null - so redirecting errors to that location no longer works. The above command will contain a lot of "Permission Denied" output, but the jar you're looking for should be somewhere mixed in - likely in /usr/lib.
Step 5 - Configure your Spark Session
The Overview Page for the Hive Warehouse Connector provides details and current limitations. The Configuration Page details the two modes described below. Note: fine-grained Ranger access controls are bypassed in the High-Performance Read Mode (i.e. LLAP / Cluster Mode).
from pyspark.sql.types import *
#This table has column masking and row level filters in Ranger. The below query, using the HWC, has the policies applied.
hive.sql("select * from masking.customers").show()
#This query, using plain spark sql, will not have the column masking or row level filter policies applied.
spark.sql("select * from masking.customers").show()
LLAP / Cluster Mode
from pyspark.sql.types import *
#This table has column masking and row level filters in Ranger. Neither are applied in the below due to LLAP/Cluster Mode High Performance Reads
hive.sql("select * from masking.customers").show()
spark.sql("select * from masking.customers").show()