Community Articles

Find and share helpful community-sourced technical articles.
avatar
Contributor

Product handbook-dbt on CDP.png

Overview

Cloudera has implemented dbt adapters for Hive, Spark (CDE and Livy), and Impala. In addition to providing the adapters, Cloudera is offering a turn-key solution to be able to manage the end-to-end software development life cycle (SDLC) of dbt models. This solution is available in all clouds as well as on-prem deployments of CDP. It is useful for customers who would not like to use dbt Cloud for security reasons or the lack of availability of adapters in dbt cloud.

We have identified the following requirements for any solution that supports the end-to-end SDLC of data transformation pipelines using dbt. 

  1. Have multiple environments
    1. Dev
    2. Stage/Test
    3. Prod
  2. Have a dev setup where different users can do the following in an isolated way:
    1. Make changes to models
    2. Test changes
    3. See logs of tests
    4. Update docs in the models and see docs
  3. Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments
  4. See logs in stage/prod of the dbt runs
  5. See dbt docs in stage/prod
  6. Orchestration: Ability to run dbt run regularly to update the models in the warehouse, or based on events(Kafka)
  7. Everything should be part of one application(tool) like CDP or CML
  8. Alerting and Monitoring, if there is a failure how IT team will know that

 

Any deployment for dbt should also satisfy the following

  1. Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser.
  2. Support isolation across different users using it in dev
  3. Support isolation between different environments (dev/stage/prod)
  4. Secure login - SAML
  5. Be able to upgrade the adapters or core seamlessly
  6. Vulnerability scans and fixing CVEs
  7. Able to add and remove users for dbt - Admin privilege

 

Cloudera Data Platform has a service, CML, which offers users the ability to build and manage all of their machine learning workloads. The same capabilities of CML can also be used to satisfy the requirements for the end-to-end SDLC of dbt models.

 

In this document, we will show how an admin can set up the different capabilities in CML like workspaces, projects, sessions, and runtime catalogs so that an analyst can work with their dbt models without having to worry about anything else.

 

First, we will show how an admin can set up

  1. CML workspaces for different environments - development/stage/production
  2. CML runtime catalog within a workspace, with the Cloudera provided container with dbt-core and all adapters supported by Cloudera
  3. CML project for stage/prod (i.e., automated/non-development) environments. Analysts create their own projects for their development work.
  4. CML jobs to run the following commands in an automated way on a regular basis
    1. git clone
    2. dbt debug
    3. dbt run
    4. dbt doc generate
  5. CML apps to serve model documentation in stage/prod

 

Next, we will show how an analyst can build, test, and merge changes to dbt models by using

  1. CML project to work in isolation without being affected by other users
  2. CML user jupyter sessions - for interactive IDE of dbt models
  3. git to get the changes reviewed and pushed to production

Finally, we will show how by using CML all of the requirements listed above can be satisfied.

Administrator steps

Prerequisites

  1. Administrator should have access to the CDP Control Plane and admin permissions to CML
  2. There should be a CDP Environment available for use. If there isn’t one, create one as per documentation Register an AWS environment from CDP UI 
  3. Access to a git repository with basic dbt scaffolding (using proxies if needed). If such a repository does not exist, follow the steps in Getting started with dbt Core 
  4. Access to custom runtime catalog (using proxies if needed)
  5. Machine user credentials - user/pass or kerberos - for stage and production environments. See CDP machine user on creating machine users for hive/impala/spark.

Note: 

The document details a simple setup within CML where we will

  1. use one workspace for dbt for all of the dev/stage/prod environments.
  2. use one project each for stage/prod and one per user to provide access isolation
  3. use one jupyter session per user/analyst for their development, testing, and to push PRs

A more robust setup where dev, stage, and prod are isolated from each other at the hardware level can also be done by creating separate workspaces for each of dev, stage, and prod environments.


We have tested this offering only on AWS at this point but we are working to test it on other public cloud providers.

 

Step 1. Create a workspace for dbt

  1. Create a new workspace for dbt by clicking on Provision Workspace in the CML Workspaces tab.
  2. This will open up a form with several fields. Enable the toggle “Advanced Options”. See the table below for the recommended values for each of the fieldshajmera_0-1661559614067.png

    Field Name

    Recommended Value

    Description

    Workspace Name

    dbt-workspace

    This workspace will be used for all of dev/stage/prod. Or you can create one each for dev/stage/prod

    Environment

    wh-env

    Important: Pick the same CDP Environment that has the datahub cluster with hive/impala/spark or the CDW (hive/impala) OR CDE (spark) services. Otherwise, you will have to set up the right network settings to allow access to the hive/impala/spark clusters.

    CPU Settings

    Instance Type

    m5.2xlarge,8 CPU,32GiB

     

    Autoscale Range

    0-2

     

    Root Volume Size 

    512

     

    GPU Instances

    Disable

    For dbt, the user won’t need GPU Instances so that can be turned off.

    Network Settings

    Subnets for Worker Nodes

    subnet1, subnet2 

    Pick the subnets that have the datahub cluster or the CDW/CDE services. See documentation below on how to get the subnet values.

    Subnets for Load Balancer 

    subnet1,subnet2

    Same as worker nodes

    Load Balance Source Ranges

    Skip

     

    Enable Fully Private Cluster

    Disable

     

    Enable Public IP Address for Load Balancer

    Enable

     

    Restrict access to Kubernetes API server to authorized IP ranges

    Disable

     

    Production Machine Learning

    Enable Governance

    Disable

     

    Enable Model Metrics

    Disable

     

    Other Settings

    Enable TLS

    Enable

     

    Enable Monitoring

    Enable

     

    Skip Validation

    Enable

     

    Tags

    Skip

     

    CML Static Subdomain

    Skip

     
  3. Click on Provision Workspace after filling out the form. 
  4. Workspace creation takes several minutes. You can see the workspace being created along with logs in the workspace page.
    dbt-adapters in CML handbook (10).png
  5. Once created, the workspace shows up in the workspace list.
    dbt-adapters in CML handbook (11).png
  6. Click on the workspace to then work within the workspace
    dbt-adapters in CML handbook (12).png

Find the subnets needed for CML Workspace

Admin will need to choose two public subnets from the drop-down. Two subnets are a CML requirement.  We recommend using the same subnets and running the warehouse instances. If the admin is not sure which two subnets to select, they can obtain the subnets by following these steps:

  1. From CDP Console, click on the Management Console tile and then click on the environment name from the list which is used to create the CML workspace, as shown below:
    dbt-adapters in CML handbook (2).png
  2. On the Environments page, click on the Summary tab:
    dbt-adapters in CML handbook (3).png
  3. Scroll down to the Network section and all the subnets are listed in the section, as shown below
    dbt-adapters in CML handbook (4).png

Step 2. Create and enable a custom runtime in CML with dbt

In the workspace screen, click on “Runtime Catalog” to create a custom runtime with dbt.

Step 2.1. Create a new runtime environment

  1. Select the Runtime Catalog from the side menu, and click Add Runtime button:
    dbt-adapters in CML handbook (5).png
  2. Find most recent docker image here, paste most recent image and click on Validate.
    hajmera_3-1661560106005.png
  3. When validation succeeds, click on “Add to Catalog”.hajmera_4-1661560106342.png
  4. The new runtime will show up in the list of runtimes
    hajmera_5-1661560106227.png

     

Step 2.2. Set runtime as default for all new sessions

  1. In the workspace’s side menu, select Site Administration and scroll down to the Engine Images section. 
  2. Add a new Engine image by adding the following values
    hajmera_6-1661560106354.png

    Field

    Value

    Description

    Description

    dbt-cml

     

    Respository:Tag

    public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15

    Find most updated docker image here

    Default

    Enable

    Make this runtime the default for all new sessions in this workspace

Step 3. Setup projects for stage and prod (automated) environments

Admins create projects for stage and prod (and other automated) environments. Analysts can create their own projects.

Creating a new project for stage/prod requires the following steps:

  1. Create a CML project 
  2. Set up environment variables for credentials and scripts

Step 3.1. Create a CML project

  1. From the workspace screen, click on Add Project
    dbt-adapters in CML handbook (9).png
  2. Fill out the basic information for the project
    hajmera_10-1661560389201.png

    Field

    Value

    Notes

    Project Name

    prod-marketing

    Name of the dbt project running in stage/prod

    Project Description

      

    Project Visibility

    Private

    Recommend private for prod and stage

    Initial Setup

    Blank

    We will set up git repos separately via CML Jobs later in prod/stage.

  3. Runtime setup - Click on the “Advanced” tab. Add a new runtime by providing the following values and click on add runtime

    Field

    Value

    Notes

    Editor

    JupyterLab

     

    Kernel

    Python 3.9

     

    Edition

    dbt custom runtime

    This is the custom runtime that was added earlier

    Version

    1.2.0

    This version is automatically picked up from the custom runtime.

  4. Click the Create Project button on the bottom right corner of the screen
    Note: Since JupyterLab creates checkpoints, it may mess up with the dbt project. Follow the steps below to avoid this:
Redirecting Jupyter Lab’s checkpoints

Since JupyterLab creates checkpoints in each directory this interrupts with dbt project file structure and may cause an error though we can redirect these checkpoints to a specified folder. Follow the steps to redirect Jupyter lab checkpoints:

  1. Open the terminal by clicking the Terminal tile. The terminal looks like this:hajmera_40-1661562350921.png
  2. Create a new directory in the /home/cdsw folder by running the following command:
    mkdir checkpoints
  3. Copy the script to .jupyter/ directory:
    cp /build/jupyter_notebook_config.py .jupyter/
  4. Restart the session and checkpoints will be redirected to the specified directory

Step 3.2. Set environment variables to be used in automation

To avoid checking profile parameters (users credentials) to git, the user SSH key can be configured for access to the git repo (How to work with Github repositories in CML/CDSW - Cloudera Community - 303205)

  1. Click Project Settings from the side menu on the project home page and click on Advanced tab
    dbt-adapters in CML handbook (7).png
  2. Enter the environment variables. Click onhajmera_11-1661560389207.pngto add more environment variables.

    Key

    Value

    Notes

    DBT_GIT_REPO

    https://github.com/cloudera/dbt-impala-example.git

    Repository that has the dbt models and profiles.yml

    DBT_IMPALA_HOST

    DBT_IMPALA_HTTP_PATH

    DBT_IMPALA_USER

    DBT_IMPALA_PASSWORD

    DBT_IMPALA_DBNAME

    DBT_IMPALA_SCHEMA

    DBT_SPARK_CDE_HOST

    DBT_SPARK_CDE_AUTH_ENDPOINT

    DBT_SPARK_CDE_PASSWORD

    DBT_SPARK_CDE_USER

    DBT_SPARK_CDE_SCHEMA

    DBT_SPARK_LIVY_HOST

    DBT_SPARK_LIVY_USER

    DBT_SPARK_LIVY_PASSWORD

    DBT_SPARK_LIVY_DBNAME

    DBT_SPARK_LIVY_SCHEMA

    DBT_HIVE_HOST

    DBT_HIVE_HTTP_PATH

    DBT_HIVE_USER

    DBT_HIVE_PASSWORD

    DBT_HIVE_SCHEMA

    DBT_HOME

    DBT_GIT_REPO

    Adapter specific configs passed as environment variables

     

    Note:

    There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials.

  3. Environment variables will look like as shown below:hajmera_12-1661560389235.png
  4. Click the Submit button on the right side of the section

Note: 
You will have to use the credential environment variables in the profiles.yml file in the dbt project that is checked into DBT_GIT_REPO. So, the profiles.yml would look like below:

dbt_impala_demo:
  outputs:
    dev_cia_cdp:
     type: impala
     host: "{{ env_var('DBT_IMPALA_HOST') }}"
     http_path: "{{ env_var('DBT_IMPALA_HTTP_PATH') }}"
     port: 443
     auth_type: ldap
     use_http_transport: true
     use_ssl: true
     username: "{{ env_var('DBT_IMPALA_USER') }}"
     password: "{{ env_var('DBT_IMPALA_PASSWORD') }}"
     dbname: "{{ env_var('DBT_IMPALA_DBNAME') }}"
     schema: "{{ env_var('DBT_IMPALA_SCHEMA') }}"
  target: dev_cia_cdp

 

Note

Environment variables are really flexible. You can use them for any field in the profiles.yml

jaffle_shop:
 target: dev
 outputs:
   dev:
     type: "{{ env_var('DBT_ENGINE_TYPE') }}"
     host:"{{ env_var('DBT_ENGINE_HOST') }}"
     user: "{{ env_var('DBT_USER') }}"
     password: "{{ env_var('DBT_PASSWORD') }}"
     port: "{{ env_var('DBT_ENGINE_PORT') }}"
     dbname: "{{ env_var('DBT_DBNAME') }}"
     schema: "{{ env_var('DBT_SCHEMA') }}"
threads: "{{ env_var('DBT_THREADS') }}"

 

In Step 4.3. Setup dbt debug job, you will be able to test and make sure that the credentials provided to the warehouse are accurate.

Step 4. Create jobs and pipeline for stage/prod

CML jobs will be created for the following jobs to be run in order as a pipeline to be run on a regular basis whenever there is a change pushed to the dbt models repository.

  1. Get the scripts for the different jobs
  2. git clone/pull
  3. dbt debug
  4. dbt run
  5. dbt docs generate

All the scripts for the jobs are available in the custom runtime that is provided. These scripts rely on the project environment variables that have been created in the previous section.

Step 4.1 Setup scripts location

Scripts are present under the /scripts folder as part of the dbt custom runtime. However, the CML jobs file interface only lists the files under the home directory (/home/cdsw).

Create a session with the custom runtime:

hajmera_13-1661561269534.png

 

hajmera_14-1661561269875.png

and from the terminal command line, copy the scripts to the home folder.

hajmera_15-1661561269323.png

 

cp -r /scripts /home/cdsw/ 

Step 4.2. Setup git clone job

Create a new job for git clone and select the job script from the scripts folder updated in Step 4.1

hajmera_16-1661561269969.png

Update the arguments and environment variables, and create the job.

hajmera_17-1661561269872.png

Field Name

Value

Comment

Name

job-git-clone

 

Script

scripts/job-git-clone.py

This is the script which would be executed in this job step.

Arguments

/home/cdsw/dbt-impala-example/dbt_impala_demo

Path of the dbt project file, which is part of the repo.

Editor

JupyterLab

 

Kernel

Python 3.9

 

Edition

dbt custom runtime

 

Version

1.1

 

Runtime Image

public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15

Find most updated docker image here

Schedule

Recurring; Every hour

This can be configured as either Manual/Recurring or Dependent

Use a cron expression

Check, 0****

Default value

Resource profile

1vCPU/2GiB

 

Timeout In Minutes

-

Optional timeout for the job

Environment Variables

 

These can be used to overwrite settings passed at project level (Section 3.2)

Job Report Recipients

 

Recipients to be notified on job status

Attachments

 

Attachments if any

Step 4.3. Setup dbt debug job

hajmera_18-1661561269513.png

Field Name

Value

Comment

Name

job-dbt-debug

 

Script

scripts/job-dbt-debug.py

This is the script that would be executed in this job step.

Arguments

/home/cdsw/dbt-impala-example/dbt_impala_demo

Path of the dbt project file, which is part of the repo.

Editor

JupyterLab

 

Kernel

Python 3.9

 

Edition

dbt custom runtime

 

Version

1.1

 

Runtime Image

public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15

Find most updated docker image here

Schedulable

Dependent

Make sure that this job runs only after cloning/updating the git repo.

 

job-git-clone

Job-dbt-debug is dependent on job-git-clone, and will run only after it completes.

Resource profile

1vCPU/2GiB

 

Timeout In Minutes

-

Optional timeout for the job

Environment Variables

 

These can be used to overwrite settings passed at project level (Section 3.2)

Job Report Recipients

 

Recipients to be notified on job status

Attachments

 

Attachments if any

Step 4.4. Setup dbt run job

hajmera_19-1661561269499.png

Field Name

Value

Comment

Name

job-dbt-run

 

Script

scripts/job-dbt-run.py

This is the script which would be executed in this job step.

Arguments

/home/cdsw/dbt-impala-example/dbt_impala_demo

Path of the dbt project file, which is part of the repo.

Editor

JupyterLab

 

Kernel

Python 3.9

 

Edition

dbt custom runtime

 

Version

1.1

 

Runtime Image

public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15

Find most updated docker image here

Schedulable

Dependent

Make sure that this job depends on dbt-debug job.

 

job-dbt-debug

Job-dbt-run is dependent on job-dbt-debug, and will run only after it completes.

Resource profile

1vCPU/2GiB

 

Timeout In Minutes

-

Optional timeout for the job

Environment Variables

 

These can be used to overwrite settings passed at project level (Section 3.2)

Job Report Recipients

 

Recipients to be notified on job status

Attachments

 

Attachments if any

Step 4.5. Setup dbt docs generate job

hajmera_20-1661561269517.png

Field Name

Value

Comment

Name

job-doc-generate

 

Script

scripts/dbt-docs-generate.py

This is the script which would be executed in this job step.

Arguments

/home/cdsw/dbt-impala-example/dbt_impala_demo

Path of the dbt project file, which is part of the repo.

Editor

JupyterLab

 

Kernel

Python 3.9

 

Edition

dbt custom runtime

 

Version

1.1

 

Runtime Image

public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15

Find most updated docker image here

Schedulable

Dependent

Generate docs only after the models have been updated.

 

dbt-docs-run

dbt-docs-generate is dependent on job-dbt-run, and will run only after it completes.

Resource profile

1vCPU/2GiB

 

Timeout In Minutes

-

Optional timeout for the job

Environment Variables

 

These can be used to overwrite settings passed at project level (Section 3.2)

Job Report Recipients

 

Recipients to be notified on job status

Attachments

 

Attachments if any

 

After following the 4 steps above, there will be a pipeline with the 4 jobs that run one after the other, only when the previous job succeeds

hajmera_21-1661561269683.png

Step 5. Create an app to serve documentation

The dbt docs generate job generates static HTML documentation for all the dbt models. In this step, you will create an app to serve the documentation. The script for the app will be available in the custom runtime that is provided.

  1. Within the Project page, click on Applicationshajmera_22-1661561452428.png
  2. Create a new Application
    hajmera_23-1661561452441.png
  3. Click Set Environment Variable
    Add the environment variable TARGET_PATH. This should be the same path where dbt docs generated the target folder inside the dbt project.hajmera_24-1661561451861.png

    Field

    Value

    Comment

    Name

    dbt-prod-docs-serve

     

    Domain

    dbtproddocs

     

    Script

    scripts/dbt-docs-serve.py

    Python script to serve the static HTML doc page generated by dbt docs generate. This is part of the CML runtime image.

    Runtime

    dbt custom runtime

    dbt custom runtime which was added to the runtime catalog.

    Environment Variables



    TARGET_PATH

    Target folder path for dbt docs.

    E.g. /home/cdsw/jaffle_shop/target/

    Make sure of the exact path, especially the ‘/’ characters.

Note: To update any of the above parameters go back to application -> Application details. Settings -> update application. Click Restart to restart the application.
dbt-adapters in CML handbook (13).png

Description of Production/Stage Deployments

Details and logs for jobs

Logs are available in the workspace in the project folder

hajmera_25-1661561631267.png

 

The job run details and job logs can be found as follows:

 

  1. Individual job history can be seen in the Jobs section from the side menuhajmera_26-1661561632345.png
  2. Job run details can be seen by clicking on any Run hajmera_27-1661561632038.png

Details and logs for doc serve app

Logs for running application can be found in applications->logs

hajmera_28-1661561632509.png

 

hajmera_29-1661561632472.png

Analyst steps

Prerequisites

  1. Each analyst should have their own credentials to the underlying warehouse. They would need to set a workload password by following Setting the workload password
  2. Each analyst has their own schema/database to do their development and testing
  3. Each analyst has access to the git repo with the dbt models and has the ability to create PRs in that git repo with their changes. Admins may have to set up proxies to enable this. If you are creating your own repo as an analyst, refer to this Getting started with dbt Core 
  4. The user SSH key can be configured for access to the git repo (How to work with Github repositories in CML/CDSW - Cloudera Community - 303205)
  5. Each analyst has access to the custom runtime that is provided by Cloudera. Admins may have to set up proxies to enable this.
  6. Each analyst has permission to create their own project. We suggest that each analyst create their own dev project to work in isolation from other analysts. If not, Admins will have to create the projects using the steps below and provide access to analysts.

Step 1. Setup a dev project

Step 1.1. Create a CML project

  1. From the workspace screen, click on Add Project
    dbt-adapters in CML handbook (9).png
  2. Fill out the basic information for the projecthajmera_30-1661561896260.png

    Field

    Value

    Notes

    Project Name

    username-marketing

    If not using a shared project, we suggest prefixing the name of the project with the user name so that it is easily identified

    Project Description

      

    Project Visibility

    Private

    Recommend private for prod

    Initial Setup

    Blank

     


  3. Runtime setup - Click on the “Advanced” tab. Add a new runtime by providing the following values and click on add runtime

    Field

    Value

    Notes

    Editor

    JupyterLab

     

    Kernel

    Python 3.9

     

    Edition

    dbt custom runtime

    This is the custom runtime that was added by the admin in earlier steps.

    Version

    1.1

    This version is automatically picked up from the custom runtime.

  4. Click the Create Project button on the bottom right corner of the screen

Step 1.2. Set environment variables 

To avoid checking profile parameters (users credentials) to git, we leverage environment variables that are set at a Project-level.

  1. Click Project Settings from the side menu on the project home page and click on the Advanced tabhajmera_31-1661561895960.png
  2. Enter the environment variables. Click onhajmera_32-1661561894138.pngto add more environment variables.

    Key

    Value

    Notes

    DBT_USER

    analyst-user-name

    Username used by the analyst. See prerequisites.

    DBT_PASSWORD

    workload-password

    Set the workload password by following Setting the workload password

    DBT_HOST

    Instance host name

     

    DBT_DBNAME

    Db name to be worked on

     

    DBT_SCHEMA

    Schema used

     

    Note:
    There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials.

  3. Environment variables will look like as shown below:hajmera_33-1661561896035.png
  4. Click the Submit button on the right side of the section

    Note: 
    You will have to use the credential environment variables in the profiles.yml file in the dbt project that is checked into DBT_GIT_REPO. So, the profiles.yml would look like below:

    jaffle_shop:
     target: dev
     outputs:
       dev:
         type: impala
         host:coordinator-dbt-impala.dw-ciadev.cna2-sx9y.cloudera.site
         user: "{{ env_var('DBT_USER') }}"
         password: "{{ env_var('DBT_PASSWORD') }}"
         port: 5432
         dbname: jaffle_shop
         schema: dbt_alice
         threads: 4

    Note

    Environment variables are really flexible. You can use them for any field in the profiles.yml

    jaffle_shop:
     target: dev
     outputs:
       dev:
         type: "{{ env_var('DBT_ENGINE_TYPE') }}"
         host:"{{ env_var('DBT_ENGINE_HOST') }}"
         user: "{{ env_var('DBT_USER') }}"
         password: "{{ env_var('DBT_PASSWORD') }}"
         port: "{{ env_var('DBT_ENGINE_PORT') }}"
         dbname: "{{ env_var('DBT_DBNAME') }}"
         schema: "{{ env_var('DBT_SCHEMA') }}"
         threads: "{{ env_var('DBT_THREADS') }}" 

 

Step 2. Setup jupyter session for development flow

Step 2.1. Create a new jupyter session

  1. On the project, page click on New Session
    hajmera_44-1661562444162.png
  2. Fill in the form for the session
    hajmera_35-1661562351512.png

    Field

    Value

    Notes

    Session name

    dev-user-session

    This private session will be used by the analyst for their work

    Runtime

    Editor

    JupyterLab

     

    Kernel

    Python 3.9

     

    Edition

    dbt custom runtime

     

    Version

    1.1

    Automatically picked up from the runtime

    Enable Spark

    Disabled

     

    Runtime image

     

    Automatically picked up

    Resource Profile

    1 vCPU/2GB Memory

     
  3. Click on “Start Session”. Ignore these screens if they show up.hajmera_36-1661562350896.pnghajmera_37-1661562352059.pnghajmera_38-1661562351950.png
  4. Click on Terminal to open a shell
    hajmera_39-1661562351967.png
Redirecting Jupyter Lab’s checkpoints

Since JupyterLab creates checkpoints in each directory this interrupts with dbt project file structure and may cause an error though we can redirect these checkpoints to a specified folder. Follow the steps to redirect Jupyter lab checkpoints:

  1. Open the terminal by clicking the Terminal tile. The terminal looks like this:hajmera_40-1661562350921.png
  2. Create a new directory in the /home/cdsw folder by running the following command:
    mkdir checkpoints
  3. Copy the script to .jupyter/ directory:
    cp /build/jupyter_notebook_config.py .jupyter/
  4. Restart the session and checkpoints will be redirected to the specified directory

Step 2.2. Clone dbt repository to start working on it

Clone the repository from within the terminal. Note that the ssh key for git access is a prerequisite.
Sample command:

 

 

 

 

 

 

 

 

 

 

 

 

git clone git@github.com:cloudera/dbt-impala-example.git

 

 

 

 

 

 

 

 

 

 

 

 

Once you clone the repo, you can browse the files in the repo and edit them in the built-in editor.

hajmera_41-1661562351227.png

 

hajmera_42-1661562351813.png


If the repository does not already have a profiles.yml, create your own yml file within the terminal and run dbt debug to verify that the connection works.

 

 

 

 

 

 

 

 

 

 

 

 

$ mkdir $HOME/.dbt

$ cat > $HOME/.dbt/profiles.yml

dbt_impala_demo:
  outputs:
    dev:
     type: impala
     host: demodh-manager0.cdpsaasd.eu55-rsdu.cloudera.site
     port: 443
     dbname: dbt_test
     schema: dbt_test
     user: "{{ env_var('DBT_USER') }}"
     password: "{{ env_var('DBT_PASSWORD') }}"
     auth_type: ldap
     use_http_transport: true
     use_ssl: true
     http_path: demodh/cdp-proxy-api/impala
  target: dev

$ cd dbt-impala-example/dbt_impala_demo/

$ dbt debug

 

 

 

 

 

 

 

 

 

 

 

 

The environment variables shown above can be used to avoid having the user credentials as part of the git repo if the profile.yml is checked into git. Alternatively, the env variables can be exported as command line as well before executing the dbt commands.

export  DBT_USER=srv_cia_test_user
export  DBT_PASSWORD=srv_cia_test_user_password

hajmera_43-1661562351973.png

Now you are all set!

 

You can start making changes to your models in the code editor and testing them.

Conclusion

In this document we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met. 

 

Requirement

Will this option satisfy the requirement? If yes, how? 

Have multiple environments

  1. Dev
  2. Stage
  3. Prod

Yes, as explained above.

Have a dev setup where different users can do the following (in an isolated way):

  1. Make changes to models

Yes, per user in their Session in the workspace, having checked out their own branch of the given dbt project codebase.

  1. Test changes

Yes

  1. See logs of tests

Yes

  1. Update docs in the models and see docs

Yes, by running the dbt docs server as a CML Application.

Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments

Yes, either:

  • Simple git-push and delegating to external CI/CD system
  • Configuring a CML Job.

See logs in stage/prod of the dbt runs

Yes

See dbt docs in stage/prod

Yes

Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser.

Yes, user gets a shell via CML

Support isolation across different users using it in dev

Yes, each Session workspace is isolated.

Support isolation between different environments (dev/stage/prod)

Yes

Secure login - SAML etc

Yes, controlled by the customer via CML

Be able to upgrade the adapters or core seamlessly

Cloudera will publish new Runtimes. versions of the python packages to PyPI

Vulnerability scans and fixing CVEs

Cloudera will do scans of the adapters and publish new versions with fixes.

Ability to run dbt run regularly to update the models in the warehouse

Yes, via CML Jobs.

 

You can reach out to innovation-feedback@cloudera.com if you have any questions.

939 Views
0 Kudos