Cloudera Data Engineering offers the integration with Hive Warehouse Connector (HWC) to access and manage Hive-managed tables from Cloudera Data Engineering Spark. The HWC dependencies are built into the Cloudera Data Engineering product.
HWC can be used to "Read Data from Hive" and "Write Data to Hive" through Cloudera Data Engineering Spark.
The data can be read and written in the following ways using HWC. Each of these read and write modes has its own advantages and limitations. Hence, this needs to be chosen carefully based on the use case.
Read modes:
Write modes:
Note: Cloudera recommends using the "SECURE_ACCESS" read mode to enforce Fine-Grained Access Control (FGAC) through Ranger authorization in the production environment with large workloads.
HWC provides a wide range of functionalities to Cloudera Data Engineering, which includes:
Connection to HiveServer2 is required to enforce the FGAC policies using the "SECURE_ACCESS" / "JDBC_CLUSTER" read modes; however, "DIRECT_READER" does not require connection to HiveServer2.
Read modes:
-> DIRECT_READER_V2: HWC needs connection to Hive Metastore (HMS). Connection to HiveServer2 is not required. Use this only if ETL jobs do not require FGAC. Spark reads the data directly from the managed table location using the transaction snapshot.
Limitations:
Configurations:
spark.datasource.hive.warehouse.metastoreUri=thrift://<HMS-Host>:9083
spark.hadoop.hive.zookeeper.quorum=<Zk-Host>:2181
spark.sql.hive.hiveserver2.jdbc.url=<HS2-Url>
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@<REALM>
spark.security.credentials.hiveserver2.enabled=true
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.read.mode=DIRECT_READER_V2
spark.hadoop.secure.access.cache.disable=true
-> JDBC_CLUSTER: HWC makes a connection to HiveServer (HS2) to get transaction information. This read mode is secured through Ranger authorization and supports FGAC, such as column masking.
Limitations:
Configurations:
spark.datasource.hive.warehouse.metastoreUri=thrift://<HMS-Host>:9083
spark.hadoop.hive.zookeeper.quorum=<Zk-Host>:2181
spark.sql.hive.hiveserver2.jdbc.url=<HS2-Url>
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@<REALM>
spark.security.credentials.hiveserver2.enabled=true
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.read.mode=JDBC_CLUSTER
spark.hadoop.secure.access.cache.disable=true
-> SECURE_ACCESS: Allows FGAC column masking and row filtering to secure the data of managed (ACID) or external Hive tables read from Cloudera Data Engineering Spark. It is recommended for large workloads and low-latency requirements. HWC creates external tables on a configured staging location when launched. It uses CREATE TABLE AS SELECT (CTAS) to create the tables. Ranger policies are applied and FGAC is enforced during the CTAS operation. Users read external tables in the secure staging location.
Configurations:
spark.hadoop.hive.metastore.uris=thrift://<HMS-Host>:9083
spark.datasource.hive.warehouse.metastoreUri=thrift://<HMS-Host>:9083
spark.hadoop.hive.zookeeper.quorum=<Zk-Host>:2181
spark.sql.hive.hiveserver2.jdbc.url=<HS2-Url>
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@<REALM>
spark.security.credentials.hiveserver2.enabled=true
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.read.mode=SECURE_ACCESS
spark.datasource.hive.warehouse.load.staging.dir=<staging-dir-path>
spark.hadoop.secure.access.cache.disable=true
The following diagram shows the typical read authorisation process:
Write modes:
-> HIVE_WAREHOUSE_CONNECTOR: HWC requires the connection to HiveServer (HS2) to perform batch writes from Cloudera Data Engineering Spark to Hive. HWC writes to an intermediate location as defined by the value of the spark.datasource.hive.warehouse.load.staging.dir configuration.
Configurations:
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.datasource.hive.warehouse.load.staging.dir=<staging-dir-path>
spark.datasource.hive.warehouse.read.mode=secure_access
spark.security.credentials.hiveserver2.enabled=true
spark.sql.hive.hiveserver2.jdbc.url=<HS2-Url>
spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@<REALM>
spark.hadoop.secure.access.cache.disable=true
-> DATAFRAME_TO_STREAM: HWC does not rely on HiveServer2 for streaming. Instead, it interacts with HMS for transaction management and writes ORC bucket files directly to the table's location. Make sure that the target Hive table is transactional and pre-created before writing to it.
Configurations:
spark.datasource.hive.warehouse.metastoreUri=thrift://<HMS-Host>:9083
spark.hadoop.hive.zookeeper.quorum=<Zk-Host>:2181
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.sql.warehouse.dir=<Hive-Warehouse-Dir-S3-Path>
-> STREAM_TO_STREAM: HWC supports streaming data from Spark into Hive tables, enabling real-time data ingestion. However, the target Hive table must be transactional and pre-created before writing to it.
Configurations:
spark.datasource.hive.warehouse.metastoreUri=thrift://<HMS-Host>:9083
spark.hadoop.hive.zookeeper.quorum=<Zk-Host>:2181
spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.sql.warehouse.dir=<Hive-Warehouse-Dir-Path>
spark.sql.streaming.checkpointLocation=<Stream-checkpoint-dir-Path>
Example:
## Write Using HIVE_WAREHOUSE_CONNECTOR ##
<DF>.write.format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).mode("append").option("database",db).option("table",tname).option("fileformat","orc").save()
## Write Using DATAFRAME_TO_STREAM ##
<DF>.write.format(HiveWarehouseSession().DATAFRAME_TO_STREAM).mode("append").option("metastoreUri", "<MetaStoreUri>").option("metastoreKrbPrincipal", "<Principal>").option("database","<db>").option("table","<table>").option("fileformat","orc").save()
## Write Using STREAM_TO_STREAM ##
query = <StreamDF>.writeStream.format(HiveWarehouseSession().STREAM_TO_STREAM).outputMode("append").option("metastoreUri", "<MetaStoreUri>").option("database", "<db>").option("table", "<pre-created-table>").trigger(processingTime='1 seconds').start()
query.awaitTermination()
The following diagram shows the typical write authorisation process.
Establishing connectivity to HiveServer2
Based on your infrastructure setup, you can use one of these methods to establish a connection to HS2 required for HWC:
Ready to Learn More?