Created on 11-17-2022 12:00 AM - edited 06-26-2023 04:51 AM
Cloudera has implemented dbt adapters for Hive, Spark, and Impala. In addition to providing the adapters, Cloudera is also offering dbt capabilities to our on-prem customers (CDP PvC-Base). Our solution satisfies our customers' stringent security and privacy requirements while providing an easy-to-use turnkey solution for practitioners.
In this document, we will show how users can run dbt on YARN without having to worry about anything else. We have packaged everything as a .tar.gz file.
If you are new to dbt, refer to the article to learn more about dbt.
We have created a dbt deployment package to simplify the setup process for dbt on yarn. We will only have to deploy this package on the cluster gateway machine. We will also make sure that we dont have a need to deploy packages locally on any of the worker nodes. Our deployment will take care of it too.
SSH to your gateway machine by running the following command:
ssh <gateway machine>
mkdir </path/to/working/dir>Sample terminal output
cd </path/to/working/dir>
wget https://github.com/cloudera/cloudera-dbt-deployment/releases/download/Prerelease/cloudera-dbt-deployment-1.2.0.tar.gz
python3 -m venv <Virtual Environment Name>Sample terminal output
source <Virtual Environment Name>/bin/activate
python3 -m pip install cloudera-dbt-deployment-1.2.0.tar.gz
We have packaged dbt core and all the dbt adapters we support into a single deployable package. This package will get updated on a regular basis whenever the underlying python packages are updated.
We will need to download this package and upload to HDFS.
command:
wget https://github.com/cloudera/cloudera-dbt-deployment/releases/download/Prerelease/Centos7_x86_x64_dbt...
hdfs dfs -copyFromLocal Centos7_x86_x64_dbt_dependencies.tar.gz hdfs://path/to/dependencies
At this point, we are ready to work with specific dbt projects.
We will use a sample dbt project from this repo. If you would like to start a new dbt project from scratch, refer to Getting started with dbt Core | dbt Developer Hub. Clone the sample project by running the following command:
git clone https://github.com/cloudera/dbt-hive-example.git
Create environment variables for the dbt project in your working directory by running the following command. This file will be used by the yarn_dbt command. We need to create this file in the folder where the dbt project lives.
cd </path/to/dbt-project>
vi yarn.env
Make sure the set the following variables in the file:
DEPENDENCIES_PACKAGE_PATH_HDFS=/tmp
DEPENDENCIES_PACKAGE_NAME=Centos7_x86_x64_dbt_dependencies.tar.gz
YARN_JAR=/opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar
DBT_SERVICE_USER=hive
DBT_PROJECT_NAME=dbt_hive_demo
YARN_RM_URI=http://hajmera-1.vpc.cloudera.com:8088
DBT_HEADLESS_KEYTAB=/cdep/keytabs/systest.keytab
DBT_HEADLESS_PRINCIPAL=systest@VPC.CLOUDERA.COM
CURRENT_DBT_USER=systest
DBT_DOCS_PORT=7777
YARN_CONTAINER_MEMORY=2048
YARN_TIMEOUT=1200000
APPLICATION_TAGS="cia-user-ha-testing.dbt-debug"
Refer to the table below to understand environment variables:
Key | Sample values | Notes |
DEPENDENCIES_PACKAGE_PATH_HDFS | /tmp | Path in hdfs containing the tarball with all python packages needed for dbt for an offline install. Downloaded in previous section. |
DEPENDENCIES_PACKAGE_NAME | e.g. Centos7_x86_x64_dbt_dependencies.tar.gz, | Name of package in hdfs for offline install |
YARN_JAR | e.g: /opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar | Distributed Shell Jar path for YARN job execution. This needs to be changed based on the CDP Base version. |
DBT_SERVICE_USER | e.g: hive | Service user with access to YARN resources. This user’s key tab is distributed through cloudera SCM and can be found in location /var/run/cloudera-scm-agent/process/ |
DBT_PROJECT_NAME | e.g: dbt_hive_demo | Project name |
DBT_HEADLESS_KEYTAB | e.g: We use keytab for systest user found at path /cdep/keytabs/systest.keytab | A headless key tab corresponding to a POSIX user that can start services without prompting for password. E.g. hdfs,hbase,... |
DBT_HEADLESS_PRINCIPAL | e.g: | Kerberos Principal for above dbt headless keytab |
CURRENT_DBT_USER | systest | Logged in user in the session with valid keytab |
DBT_DOCS_PORT | 7777 | Port where dbt docs are hosted |
YARN_CONTAINER_MEMORY | 2048 | Memory allocation for YARN container in mb |
YARN_TIMEOUT | 1200000 | Time out for YARN container in milliseconds |
APPLICATION_TAGS | cia-user-ha-testing.dbt-debug | Prefix/identifier for YARN application, can be seen in YARN-UI |
We will be using the Kerberos method to connect to the query engines. So, the profiles.yml file should reflect it. Edit the profiles.yml file as per warehouse configurations by running the following command:
cd /path/to/dbt-models
vi profiles.yml
dbt_hive_demo:
outputs:
dbt_hive_demo:
auth_type: kerberos
host: hajmera-1.vpc.cloudera.com
port: 10000
schema: dbt_hive_demo
threads: 2
type: hive
use_http_transport: false
use_ssl: false
kerberos_service_name: hive
target: dbt_hive_demo
Provide an authentication token to execute dbt by running the following command:
kinit -kt </path/to/keytab/file> <username>
In the example repo, we have a sample dbt project called ‘dbt_hive_demo’.
Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands supported with YARN deployment:
More commands and a detailed description of each command can be found here.
Note: yarn_dbt commands needs to be run in the same folder that yarn.env is located. |
To ensure we’ve configured our profile correctly, test the connection by running the following command:
yarn_dbt debug
Load the reference dataset to the warehouse by running the following command:
yarn_dbt seed
Our Seeds are configured with a couple of tests. Users can read more about it here
We also have a custom test created in dbt_hive_demo/tests/generic/test_length.sql. This test is used to check the character length of a column. Our reference data includes ISO Alpha2 and Alpha3 country codes - we know these columns should always be 2 or 3, respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results. We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null. Test the dataset by running the following command:
yarn_dbt test
We have 3 sets of models in this demo project.
Execute all the transformations by running the following command:
yarn_dbt run
yarn_dbt docsSample terminal output
yarn logs -applicationId <application-id> | grep 7777
Sample terminal outputhajmera-3.vpc.cloudera.com:7777
In order to debug issues further, you can view the logs in the yarn UI.
dbt models can be scheduled using any workflow management platform like Apache Airflow or Apache Oozie.
In this document, we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met.
Requirement | Will this option satisfy the requirement? If yes, how? |
Have a dev setup where different users can do the following (in an isolated way):
| Yes, per user can log into gateway machine, having checked out their own branch of the given dbt project codebase. |
| Yes |
| Yes |
| Yes |
Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments | Yes, simple git-push and delegating to external CI/CD system |
See logs in stage/prod of the dbt runs | Yes |
See dbt docs in stage/prod | Yes |
Support isolation across different users using it in dev | Yes, each session is isolated and use their own login credentials |
Be able to upgrade the adapters or core seamlessly | Cloudera will publish new github assets, under github releases. |
Ability to run dbt run regularly to update the models in the warehouse | Yes, via any scheduler like Apache Airflow |
Facing issues while working with dbt? We have troubleshooting guide, to help you resolve issues.
You can also reach out to innovation-feedback@cloudera.com if you have any questions.