Cloudera has implemented dbt adapters for Hive, Spark, and Impala. In addition to providing the adapters, Cloudera is also offering dbt capabilities to our on-prem customers (CDP PvC-Base). Our solution satisfies our customers' stringent security and privacy requirements while providing an easy-to-use turnkey solution for practitioners.
In this document, we will show how users can run dbt on YARN without having to worry about anything else. We have packaged everything as a .tar.gz file.
If you are new to dbt, refer to the article to learn more about dbt.
We have created a dbt deployment package to simplify the setup process for dbt on yarn. We will only have to deploy this package on the cluster gateway machine. We will also make sure that we dont have a need to deploy packages locally on any of the worker nodes. Our deployment will take care of it too.
SSH to your gateway machine by running the following command:
ssh <gateway machine>
mkdir </path/to/working/dir>Sample terminal output
python3 -m venv <Virtual Environment Name>Sample terminal output
source <Virtual Environment Name>/bin/activate
python3 -m pip install cloudera-dbt-deployment-1.2.0.tar.gz
We have packaged dbt core and all the dbt adapters we support into a single deployable package. This package will get updated on a regular basis whenever the underlying python packages are updated.
We will need to download this package and upload to HDFS.
hdfs dfs -copyFromLocal Centos7_x86_x64_dbt_dependencies.tar.gz hdfs://path/to/dependencies
At this point, we are ready to work with specific dbt projects.
We will use a sample dbt project from this repo. If you would like to start a new dbt project from scratch, refer to Getting started with dbt Core | dbt Developer Hub. Clone the sample project by running the following command:
Create environment variables for the dbt project in your working directory by running the following command. This file will be used by the yarn_dbt command. We need to create this file in the folder where the dbt project lives.
Make sure the set the following variables in the file:
Refer to the table below to understand environment variables:
Path in hdfs containing the tarball with all python packages needed for dbt for an offline install. Downloaded in previous section.
Name of package in hdfs for offline install
Distributed Shell Jar path for YARN job execution. This needs to be changed based on the CDP Base version.
Service user with access to YARN resources. This user’s key tab is distributed through cloudera SCM and can be found in location /var/run/cloudera-scm-agent/process/
e.g: We use keytab for systest user found at path /cdep/keytabs/systest.keytab
A headless key tab corresponding to a POSIX user that can start services without prompting for password. E.g. hdfs,hbase,...
Kerberos Principal for above dbt headless keytab
Logged in user in the session with valid keytab
Port where dbt docs are hosted
Memory allocation for YARN container in mb
Time out for YARN container in milliseconds
Prefix/identifier for YARN application, can be seen in YARN-UI
We will be using the Kerberos method to connect to the query engines. So, the profiles.yml file should reflect it. Edit the profiles.yml file as per warehouse configurations by running the following command:
Provide an authentication token to execute dbt by running the following command:
kinit -kt </path/to/keytab/file> <username>
In the example repo, we have a sample dbt project called ‘dbt_hive_demo’.
Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands supported with YARN deployment:
More commands and a detailed description of each command can be found here.
|Note: yarn_dbt commands needs to be run in the same folder that yarn.env is located.|
To ensure we’ve configured our profile correctly, test the connection by running the following command:
Load the reference dataset to the warehouse by running the following command:
Our Seeds are configured with a couple of tests. Users can read more about it here
We also have a custom test created in dbt_hive_demo/tests/generic/test_length.sql. This test is used to check the character length of a column. Our reference data includes ISO Alpha2 and Alpha3 country codes - we know these columns should always be 2 or 3, respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results. We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null. Test the dataset by running the following command:
We have 3 sets of models in this demo project.
Execute all the transformations by running the following command:
yarn_dbt docsSample terminal output
yarn logs -applicationId <application-id> | grep 7777Sample terminal output
In order to debug issues further, you can view the logs in the yarn UI.
dbt models can be scheduled using any workflow management platform like Apache Airflow or Apache Oozie.
In this document, we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met.
Will this option satisfy the requirement? If yes, how?
Have a dev setup where different users can do the following (in an isolated way):
Yes, per user can log into gateway machine, having checked out their own branch of the given dbt project codebase.
Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments
Yes, simple git-push and delegating to external CI/CD system
See logs in stage/prod of the dbt runs
See dbt docs in stage/prod
Support isolation across different users using it in dev
Yes, each session is isolated and use their own login credentials
Be able to upgrade the adapters or core seamlessly
Cloudera will publish new github assets, under github releases.
Ability to run dbt run regularly to update the models in the warehouse
Yes, via any scheduler like Apache Airflow
Facing issues while working with dbt? We have troubleshooting guide, to help you resolve issues.
You can also reach out to email@example.com if you have any questions.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.