Created on 08-26-2022 06:12 PM - edited 06-26-2023 05:20 AM
Cloudera has implemented dbt adapters for Hive, Spark (CDE and Livy), and Impala. In addition to providing the adapters, Cloudera is offering a turn-key solution to be able to manage the end-to-end software development life cycle (SDLC) of dbt models. This solution is available in all clouds as well as on-prem deployments of CDP. It is useful for customers who would not like to use dbt Cloud for security reasons or the lack of availability of adapters in dbt cloud.
We have identified the following requirements for any solution that supports the end-to-end SDLC of data transformation pipelines using dbt.
Any deployment for dbt should also satisfy the following
Cloudera Data Platform has a service, CML, which offers users the ability to build and manage all of their machine learning workloads. The same capabilities of CML can also be used to satisfy the requirements for the end-to-end SDLC of dbt models.
In this document, we will show how an admin can set up the different capabilities in CML like workspaces, projects, sessions, and runtime catalogs so that an analyst can work with their dbt models without having to worry about anything else.
First, we will show how an admin can set up
Next, we will show how an analyst can build, test, and merge changes to dbt models by using
Finally, we will show how by using CML all of the requirements listed above can be satisfied.
Note: The document details a simple setup within CML where we will
A more robust setup where dev, stage, and prod are isolated from each other at the hardware level can also be done by creating separate workspaces for each of dev, stage, and prod environments. We have tested this offering only on AWS at this point but we are working to test it on other public cloud providers. |
Field Name | Recommended Value | Description |
Workspace Name | dbt-workspace | This workspace will be used for all of dev/stage/prod. Or you can create one each for dev/stage/prod |
Environment | wh-env | Important: Pick the same CDP Environment that has the datahub cluster with hive/impala/spark or the CDW (hive/impala) OR CDE (spark) services. Otherwise, you will have to set up the right network settings to allow access to the hive/impala/spark clusters. |
CPU Settings | ||
Instance Type | m5.2xlarge,8 CPU,32GiB | |
Autoscale Range | 0-2 | |
Root Volume Size | 512 | |
GPU Instances | Disable | For dbt, the user won’t need GPU Instances so that can be turned off. |
Network Settings | ||
Subnets for Worker Nodes | subnet1, subnet2 | Pick the subnets that have the datahub cluster or the CDW/CDE services. See documentation below on how to get the subnet values. |
Subnets for Load Balancer | subnet1,subnet2 | Same as worker nodes |
Load Balance Source Ranges | Skip | |
Enable Fully Private Cluster | Disable | |
Enable Public IP Address for Load Balancer | Enable | |
Restrict access to Kubernetes API server to authorized IP ranges | Disable | |
Production Machine Learning | ||
Enable Governance | Disable | |
Enable Model Metrics | Disable | |
Other Settings | ||
Enable TLS | Enable | |
Enable Monitoring | Enable | |
Skip Validation | Enable | |
Tags | Skip | |
CML Static Subdomain | Skip |
Admin will need to choose two public subnets from the drop-down. Two subnets are a CML requirement. We recommend using the same subnets and running the warehouse instances. If the admin is not sure which two subnets to select, they can obtain the subnets by following these steps:
In the workspace screen, click on “Runtime Catalog” to create a custom runtime with dbt.
Field | Value | Description |
Description | dbt-cml | |
Respository:Tag | ||
Default | Enable | Make this runtime the default for all new sessions in this workspace |
Admins create projects for stage and prod (and other automated) environments. Analysts can create their own projects.
Creating a new project for stage/prod requires the following steps:
Field | Value | Notes |
Project Name | prod-marketing | Name of the dbt project running in stage/prod |
Project Description | ||
Project Visibility | Private | Recommend private for prod and stage |
Initial Setup | Blank | We will set up git repos separately via CML Jobs later in prod/stage. |
Field | Value | Notes |
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | This is the custom runtime that was added earlier |
Version | 1.2.0 | This version is automatically picked up from the custom runtime. |
Since JupyterLab creates checkpoints in each directory this interrupts with dbt project file structure and may cause an error though we can redirect these checkpoints to a specified folder. Follow the steps to redirect Jupyter lab checkpoints:
mkdir checkpoints
cp /build/jupyter_notebook_config.py .jupyter/
Restart the session and checkpoints will be redirected to the specified directory
To avoid checking profile parameters (users credentials) to git, the user SSH key can be configured for access to the git repo (How to work with Github repositories in CML/CDSW - Cloudera Community - 303205)
Key | Value | Notes |
DBT_GIT_REPO | Repository that has the dbt models and profiles.yml | |
DBT_IMPALA_HOST DBT_IMPALA_HTTP_PATH DBT_IMPALA_USER DBT_IMPALA_PASSWORD DBT_IMPALA_DBNAME DBT_IMPALA_SCHEMA DBT_SPARK_CDE_HOST DBT_SPARK_CDE_AUTH_ENDPOINT DBT_SPARK_CDE_PASSWORD DBT_SPARK_CDE_USER DBT_SPARK_CDE_SCHEMA DBT_SPARK_LIVY_HOST DBT_SPARK_LIVY_USER DBT_SPARK_LIVY_PASSWORD DBT_SPARK_LIVY_DBNAME DBT_SPARK_LIVY_SCHEMA DBT_HIVE_HOST DBT_HIVE_HTTP_PATH DBT_HIVE_USER DBT_HIVE_PASSWORD DBT_HIVE_SCHEMA DBT_HOME DBT_GIT_REPO | Adapter specific configs passed as environment variables |
Note: There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials. |
Note: dbt_impala_demo: |
Note: Environment variables are really flexible. You can use them for any field in the profiles.yml jaffle_shop: |
In Step 4.3. Setup dbt debug job, you will be able to test and make sure that the credentials provided to the warehouse are accurate.
CML jobs will be created for the following jobs to be run in order as a pipeline to be run on a regular basis whenever there is a change pushed to the dbt models repository.
All the scripts for the jobs are available in the custom runtime that is provided. These scripts rely on the project environment variables that have been created in the previous section.
Scripts are present under the /scripts folder as part of the dbt custom runtime. However, the CML jobs file interface only lists the files under the home directory (/home/cdsw).
Create a session with the custom runtime:
and from the terminal command line, copy the scripts to the home folder.
cp -r /scripts /home/cdsw/
Create a new job for git clone and select the job script from the scripts folder updated in Step 4.1
Update the arguments and environment variables, and create the job.
Field Name | Value | Comment |
Name | job-git-clone | |
Script | scripts/job-git-clone.py | This is the script which would be executed in this job step. |
Arguments | /home/cdsw/dbt-impala-example/dbt_impala_demo | Path of the dbt project file, which is part of the repo. |
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | |
Version | 1.1 | |
Runtime Image | Find most updated docker image here | |
Schedule | Recurring; Every hour | This can be configured as either Manual/Recurring or Dependent |
Use a cron expression | Check, 0**** | Default value |
Resource profile | 1vCPU/2GiB | |
Timeout In Minutes | - | Optional timeout for the job |
Environment Variables | These can be used to overwrite settings passed at project level (Section 3.2) | |
Job Report Recipients | Recipients to be notified on job status | |
Attachments | Attachments if any |
Field Name | Value | Comment |
Name | job-dbt-debug | |
Script | scripts/job-dbt-debug.py | This is the script that would be executed in this job step. |
Arguments | /home/cdsw/dbt-impala-example/dbt_impala_demo | Path of the dbt project file, which is part of the repo. |
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | |
Version | 1.1 | |
Runtime Image | public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 | Find most updated docker image here |
Schedulable | Dependent | Make sure that this job runs only after cloning/updating the git repo. |
job-git-clone | Job-dbt-debug is dependent on job-git-clone, and will run only after it completes. | |
Resource profile | 1vCPU/2GiB | |
Timeout In Minutes | - | Optional timeout for the job |
Environment Variables | These can be used to overwrite settings passed at project level (Section 3.2) | |
Job Report Recipients | Recipients to be notified on job status | |
Attachments | Attachments if any |
Field Name | Value | Comment |
Name | job-dbt-run | |
Script | scripts/job-dbt-run.py | This is the script which would be executed in this job step. |
Arguments | /home/cdsw/dbt-impala-example/dbt_impala_demo | Path of the dbt project file, which is part of the repo. |
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | |
Version | 1.1 | |
Runtime Image | public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 | Find most updated docker image here |
Schedulable | Dependent | Make sure that this job depends on dbt-debug job. |
job-dbt-debug | Job-dbt-run is dependent on job-dbt-debug, and will run only after it completes. | |
Resource profile | 1vCPU/2GiB | |
Timeout In Minutes | - | Optional timeout for the job |
Environment Variables | These can be used to overwrite settings passed at project level (Section 3.2) | |
Job Report Recipients | Recipients to be notified on job status | |
Attachments | Attachments if any |
Field Name | Value | Comment |
Name | job-doc-generate | |
Script | scripts/dbt-docs-generate.py | This is the script which would be executed in this job step. |
Arguments | /home/cdsw/dbt-impala-example/dbt_impala_demo | Path of the dbt project file, which is part of the repo. |
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | |
Version | 1.1 | |
Runtime Image | public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 | Find most updated docker image here |
Schedulable | Dependent | Generate docs only after the models have been updated. |
dbt-docs-run | dbt-docs-generate is dependent on job-dbt-run, and will run only after it completes. | |
Resource profile | 1vCPU/2GiB | |
Timeout In Minutes | - | Optional timeout for the job |
Environment Variables | These can be used to overwrite settings passed at project level (Section 3.2) | |
Job Report Recipients | Recipients to be notified on job status | |
Attachments | Attachments if any |
After following the 4 steps above, there will be a pipeline with the 4 jobs that run one after the other, only when the previous job succeeds
The dbt docs generate job generates static HTML documentation for all the dbt models. In this step, you will create an app to serve the documentation. The script for the app will be available in the custom runtime that is provided.
Field | Value | Comment |
Name | dbt-prod-docs-serve | |
Domain | dbtproddocs | |
Script | scripts/dbt-docs-serve.py | Python script to serve the static HTML doc page generated by dbt docs generate. This is part of the CML runtime image. |
Runtime | dbt custom runtime | dbt custom runtime which was added to the runtime catalog. |
Environment Variables | TARGET_PATH | Target folder path for dbt docs. E.g. /home/cdsw/jaffle_shop/target/ Make sure of the exact path, especially the ‘/’ characters. |
Note: To update any of the above parameters go back to application -> Application details. Settings -> update application. Click Restart to restart the application.
Logs are available in the workspace in the project folder
The job run details and job logs can be found as follows:
Logs for running application can be found in applications->logs
Field | Value | Notes |
Project Name | username-marketing | If not using a shared project, we suggest prefixing the name of the project with the user name so that it is easily identified |
Project Description | ||
Project Visibility | Private | Recommend private for prod |
Initial Setup | Blank |
Field | Value | Notes |
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | This is the custom runtime that was added by the admin in earlier steps. |
Version | 1.1 | This version is automatically picked up from the custom runtime. |
To avoid checking profile parameters (users credentials) to git, we leverage environment variables that are set at a Project-level.
Key | Value | Notes |
DBT_USER | analyst-user-name | Username used by the analyst. See prerequisites. |
DBT_PASSWORD | workload-password | Set the workload password by following Setting the workload password |
DBT_HOST | Instance host name | |
DBT_DBNAME | Db name to be worked on | |
DBT_SCHEMA | Schema used |
Note: |
Note: jaffle_shop: |
Note: Environment variables are really flexible. You can use them for any field in the profiles.yml jaffle_shop: |
Field | Value | Notes |
Session name | dev-user-session | This private session will be used by the analyst for their work |
Runtime | ||
Editor | JupyterLab | |
Kernel | Python 3.9 | |
Edition | dbt custom runtime | |
Version | 1.1 | Automatically picked up from the runtime |
Enable Spark | Disabled | |
Runtime image | Automatically picked up | |
Resource Profile | 1 vCPU/2GB Memory |
Since JupyterLab creates checkpoints in each directory this interrupts with dbt project file structure and may cause an error though we can redirect these checkpoints to a specified folder. Follow the steps to redirect Jupyter lab checkpoints:
mkdir checkpoints
cp /build/jupyter_notebook_config.py .jupyter/
Restart the session and checkpoints will be redirected to the specified directory
Clone the repository from within the terminal. Note that the ssh key for git access is a prerequisite.
Sample command:
git clone git@github.com:cloudera/dbt-impala-example.git
Once you clone the repo, you can browse the files in the repo and edit them in the built-in editor.
If the repository does not already have a profiles.yml, create your own yml file within the terminal and run dbt debug to verify that the connection works.
$ mkdir $HOME/.dbt
$ cat > $HOME/.dbt/profiles.yml
dbt_impala_demo:
outputs:
dev:
type: impala
host: demodh-manager0.cdpsaasd.eu55-rsdu.cloudera.site
port: 443
dbname: dbt_test
schema: dbt_test
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
auth_type: ldap
use_http_transport: true
use_ssl: true
http_path: demodh/cdp-proxy-api/impala
target: dev
$ cd dbt-impala-example/dbt_impala_demo/
$ dbt debug
The environment variables shown above can be used to avoid having the user credentials as part of the git repo if the profile.yml is checked into git. Alternatively, the env variables can be exported as command line as well before executing the dbt commands.
export DBT_USER=srv_cia_test_user
export DBT_PASSWORD=srv_cia_test_user_password
Now you are all set!
You can start making changes to your models in the code editor and testing them.
In this document we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met.
Requirement | Will this option satisfy the requirement? If yes, how? |
Have multiple environments
| Yes, as explained above. |
Have a dev setup where different users can do the following (in an isolated way):
| Yes, per user in their Session in the workspace, having checked out their own branch of the given dbt project codebase. |
| Yes |
| Yes |
| Yes, by running the dbt docs server as a CML Application. |
Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments | Yes, either:
|
See logs in stage/prod of the dbt runs | Yes |
See dbt docs in stage/prod | Yes |
Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser. | Yes, user gets a shell via CML |
Support isolation across different users using it in dev | Yes, each Session workspace is isolated. |
Support isolation between different environments (dev/stage/prod) | Yes |
Secure login - SAML etc | Yes, controlled by the customer via CML |
Be able to upgrade the adapters or core seamlessly | Cloudera will publish new Runtimes. versions of the python packages to PyPI |
Vulnerability scans and fixing CVEs | Cloudera will do scans of the adapters and publish new versions with fixes. |
Ability to run dbt run regularly to update the models in the warehouse | Yes, via CML Jobs. |
You can reach out to innovation-feedback@cloudera.com if you have any questions.