Created on 05-19-2020 12:22 PM - edited on 09-02-2020 04:00 PM by cjervis
This article explains how to setup Hive Warehouse Connector (HWC), in CDP Public Cloud CML (tested with CDP Public Cloud runtime 7.1).
A recent update has made this step unnecessary.
find / -name *hive-warehouse-connector*
*Note: CML no longer allows "access" to /dev/null - so redirecting errors to that location no longer works. The above command will contain a lot of "Permission Denied" output, but the jar you're looking for should be somewhere mixed in - likely in /usr/lib.
The Overview Page for the Hive Warehouse Connector provides details and current limitations. The Configuration Page details the two modes described below. Note: fine-grained Ranger access controls are bypassed in the High-Performance Read Mode (i.e. LLAP / Cluster Mode).
from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
spark = SparkSession\
.builder\
.appName("PythonSQL-Client")\
.master("local[*]")\
.config("spark.yarn.access.hadoopFileSystems","s3a:///[STORAGE_LOCATION]")\
.config("spark.hadoop.yarn.resourcemanager.principal", "[Your_User]")\
.config("spark.sql.hive.hiveserver2.jdbc.url", "[VIRTUAL_WAREHOUSE_HS2_JDBC_URL];user=[Your_User];password=[Your_Workload_Password]")\
.config("spark.datasource.hive.warehouse.read.via.llap", "false")\
.config("spark.datasource.hive.warehouse.read.jdbc.mode", "client")\
.config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")\
.config("spark.datasource.hive.warehouse.load.staging.dir", "/tmp")\
//No longer necessary .config("spark.jars", "[HWC_Jar_Location]")\
.getOrCreate()
hive = HiveWarehouseSession.session(spark).build()
Note: LLAP / Cluster Mode doesn't require the HiveWarehouseSession, though you are free to use it for consistency between the modes.
from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
spark = SparkSession\
.builder\
.appName("PythonSQL-Cluster")\
.master("local[*]")\
.config("spark.yarn.access.hadoopFileSystems","s3a:///[STORAGE_LOCATION]")\
.config("spark.hadoop.yarn.resourcemanager.principal", "[Your_User]")\
.config("spark.sql.hive.hiveserver2.jdbc.url", "[VIRTUAL_WAREHOUSE_HS2_JDBC_URL];user=[Your_User];password=[Your_Workload_Password]")\
.config("spark.datasource.hive.warehouse.read.via.llap", "true")\
.config("spark.datasource.hive.warehouse.read.jdbc.mode", "cluster")\
.config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")\
.config("spark.datasource.hive.warehouse.load.staging.dir", "/tmp")\
//No longer necessary .config("spark.jars", "[HWC_Jar_Location]")\
.config("spark.sql.hive.hwc.execution.mode", "spark")\
.config("spark.sql.extensions", "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension")\
.getOrCreate()
hive = HiveWarehouseSession.session(spark).build()
Add your Spark SQL...
from pyspark.sql.types import *
#This table has column masking and row level filters in Ranger. The below query, using the HWC, has the policies applied.
hive.sql("select * from masking.customers").show()
#This query, using plain spark sql, will not have the column masking or row level filter policies applied.
spark.sql("select * from masking.customers").show()
from pyspark.sql.types import *
#This table has column masking and row level filters in Ranger. Neither are applied in the below due to LLAP/Cluster Mode High Performance Reads
hive.sql("select * from masking.customers").show()
spark.sql("select * from masking.customers").show()