Created on 11-02-2022 01:35 PM - edited 06-26-2023 05:03 AM
This article lists issues that may arise while using dbt adapters and their resolution steps.
Please reach out to us at innovation-feedback@cloudera.com if you see any issues not listed here.
Recreating a dbt model in Impala which was originally created using dbt Hive produces an error similar to
MetaException: Cannot change stats state for a transactional table <table-name> without providing the transactional write state for verification (new write ID 1, valid write IDs null; current state {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}; new state {"BASIC_STATS":"true"}
Solved this issue by setting the following two properties to false:
set hive.stats.autogather=false
set hive.txn.stats.enabled=false
In the Hive services and restarting. Follow the guide here on how to set properties for the hive service:
Setting Hive Configuration Overrides
Screenshot (of how the properties are set for the Hive service)
While accessing Managed extended table you may see the error:
org.apache.spark.sql.AnalysisException:
Spark has no access to table `XXX`.`XXXXX`. Clients can access this table only if
they have the following capabilities: CONNECTORREAD,HIVEMANAGEDINSERTREAD,HIVEMANAGEDINSERTWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE.
This table may be a Hive-managed ACID table, or require some other capability that Spark
currently does not implement;
at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280)
at org.apache.spark.sql.hive.HiveTranslationLayerCheck$SanonfunSapply$1.apply0rElse(HiveTranslationLayerStrategies.scala:109)
at org.apache.spark.sql.hive.HiveTranslationLayerCheck$SanonfunSapply$1.apply0rElse(HiveTranslationLayerStrategies.scala:85)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1$Sanonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1$Sanonfun$2.apply(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.trees.CurrentOriginS.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.plans. logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelperSclass.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache. spark. sql. catalyst.plans. logical.LogicalPlan. resolveOperatorsDown(LogicalPlan.scala: 29)
at org.apache.spark.sql.catalyst.plans. logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1$SanonfunSapply$6.apply(AnalysisHelper.scala:113)
To solve this issue, we need the following configurations.
Property | Example Value | Description |
spark.sql.extensions | com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension | Extension needed for auto-translate to work |
spark.kryo.registrator | com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator | For using kryo serialization. |
spark.sql.hive.hwc.execution.mode | spark |
|
spark.driver.extraClassPath | local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar | to use HWC(Hive Warehouse Connector) jars. |
spark.executor.extraClassPath | local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar |
|
The hive Warehouse connector jar should be supplied along with the configs to be able to access managed tables. You can do the above by either:
Introduction to HWC | CDP Private Cloud
If a managed table is created in Hive and is accessed with Spark-livy, it may throw permission errors shown below:
12:48:21 Completed with 1 error and 0 warnings:
12:48:21
12:48:21 Runtime Error in model test2 (models/staging/test2.sql)
12:48:21 Database Error
12:48:21 Error while executing query:
12:48:21 Spark has no access to table `certified_data`.`test_hive_streaming2`. Clients can access this table only if
12:48:21 they have the following capabilities: CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE.
12:48:21 This table may be a Hive-managed ACID table, or require some other capability that Spark
12:48:21 currently does not implement;
12:48:21
12:48:21 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
This issue can be resolved by adding the properties in spark configuration:
spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator
spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension
spark.sql.hive.hwc.execution.mode=spark
spark.driver.extraClassPath=local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar
spark.executor.extraClassPath=local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar
Introduction to HWC | CDP Private Cloud
While using the spark-cde and spark-livy adapter with a warehouse that uses a self-sign certificate, while running dbt debug and other dbt commands you may get an error as shown below :
certificate verify failed: self-signed certificate
To disable SSL certificate verification, in the profile file for the adapter you can set the Additional variable as shown below:
verify_ssl_certificate: false
dbt user while running dbt command may see the error as shown below, this error occurs when user doesn't have CREATE TABLE permissions:
dbt run
11:57:43 Running with dbt=1.1.0
11:57:43 Found 1 model, 0 tests, 0 snapshots, 0 analyses, 208 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
11:57:43
11:59:02 Encountered an error:
Runtime Error
Runtime Error
Database Error
Error while executing query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [airflow.bigdata] does not have [WRITE] privilege on [[hdfs://dlhdfs/warehouse/tablespace/external/hive/test_dbt.db, hdfs://dlhdfs/warehouse/tablespace/external/hive/test_dbt.db/]]);
Check the dbt logs to understand what permissions are needed for different tables, and get those permissions to the user.
What privileges does my database user need to use dbt? | dbt Developer Hub
While running dbt commands using a dbt-spark-cde adapter, we have observed unpredictable latency while completing the execution of dbt commands.
We did a bunch of work on improving the performance of the cde adapter and have added instrumentation to understand where the latencies are.
There are several factors that cause latencies in CDE
Running any SQL query in CDE has the following steps:
Taking GitHub - cloudera/dbt-spark-cde-example as an example dbt project, here’s the breakdown of the CDE’s steps for the two most common dbt commands:
This command executes the following SQL query:
Breakdown as per CDE’s steps:
Step name | Time taken (seconds) | Log | |
1 | Generate a job name. | 0 | 07:00:46.778564 Job created with id: |
2 | Create a resource | 1 | 07:00:46.779102 Create resources: files |
3 | Upload the resource. | 0 | 07:00:47.178924 Upload resource: SQL resource: |
4 | Submit the job | 1 | 07:00:47.929512 Submit job |
5 | Run the job (and wait for the job to change state from start to succeed/fail) | 62 | 07:00:48.347828 Run job |
6 | Fetch the job results from log (after waiting for the job logs to be aggregated) | 82 | 07:01:50.545694 Get job output |
7 | Clean the resources | 0 | 07:03:12.060899 Delete job |
Total | 146 |
NOTE: Typically, dbt run executes all the models inside the dbt project. For sake of simplicity, we have only executed a single model (i.e. /models/staging/covid/stg_covid__cases.sql)
We will consider running dbt run on a single model by running the command dbt run --select /models/staging/covid. This command executes the following SQL statements, as dbt-core tries to build a DAG before it starts to build the model. The first 5 statements are the “fixed cost” for any dbt run. If there are multiple models, there will only be one invocation of the first 5 statements. The latency of the dbt run command is then determined by the number of models and the time taken for each model.
1. show databases
2. show table extended in spark_cde_demo_reference like '*'
3. show table extended in spark_cde_demo_mart_covid like'*'
4. show table extended in spark_cde_demo like '*'
5. show table extended in spark_cde_demo_staging_covid like '*'
6. create or replace view spark_cde_demo_staging_covid.stg_covid__cases
as
select
cast(cast(unix_timestamp(date_rep,"dd/MM/yyyy") as timestamp) AS DATE) as date_rep,
cast(cases as BIGINT) as cases,
cast(deaths as BIGINT) as deaths,
geo_id
from spark_cde_demo.raw_covid_cases
For each statement, the breakdown as per CDE’s steps can be obtained from the dbt logs.
Let’s consider the execution of the last query in CDE, the breakdown as per CDE’s steps:
Step name | Time taken (seconds) | Log | |
1 | Generate a job name. | 0 | 07:27:39.914554 Job created with id: |
2 | Create a resource | 1 | 07:27:39.915082 Create resources: files |
3 | Upload the resource. | 0 | 07:27:40.213107 Upload resource: |
4 | Submit the job | 1 | 07:27:40.913892 Submit job |
5 | Run the job (and wait for the job to change state from start to succeed/fail) | 62 | 07:27:41.223755 Run job |
6 | Fetch the job results from log (after waiting for the job logs to be aggregated) | 82 | 07:28:43.849548 Get job output |
7 | Clean the resources | 0 | 07:30:05.096386 Delete job |
Total | 146 |
Since each query follows the aforementioned 7 steps, to execute each query in the above example CDE will take similar time. This means to execute a single dbt model on average it takes
6 * 146(average time to execute each query) = 876 seconds or 14.6 minutes |
If the dbt model is complex or if multiple dbt models are involved in a single dbt command this latency increases accordingly.
While running dbt commands on yarn, we have seen this error where commands are not issued from correct directory which is containing yarn.env file.
Assuming your dbt project directory path is ~/my-project-dir/sample-dbt-project .
Your working directory to issue yarn commands is ~/my-project-dir/
Create your yarn.env file in the location ~/my-project-dir
While running dbt commands on yarn, we have seen this error where commands are not issued from correct directory which is containing yarn.env file.
Assuming your dbt project directory path is ~/my-project-dir/sample-dbt-project .
Your working directory to issue yarn commands is ~/my-project-dir/
Usually dbt fails to connect to warehouse in case of wrong values in profiles.yml
A sample profiles file(profiles.yml) to connect to hive warehouse is:
dbt_hive_demo:
outputs:
dev:
auth_type: kerberos
host: tsenapati-3.tsenapati.root.hwx.site
port: 10000
schema: dbt_hive_demo
threads: 4
type: hive
use_http_transport: false
use_ssl: false
kerberos_service_name: hive
target: dev
If connection fails try troubleshooting without yarn/dbt involved. Follow the steps below:
pip install impyla
from impala.dbapi import connect
conn = connect(
host = "host.name",
port = portnumber,
auth_mechanism = "GSSAPI",
use_ssl = False,
use_http_transport = False,
kerberos_service_name="hive")
cursor = conn.cursor()
cursor.execute("SHOW DATABASES")
res = cursor.fetchall()
print(res)
cursor.close()
conn.close()
python3 connectiontest.py
If the above script succeeds and shows sample databases, then your connection parameters are correct. Populate the same in profiles.yml accordingly.