Innovation Blog

[VIDEO] How to get started with dbt within CDP

Cloudera Employee

Introduction

Cloudera recently announced the open-source dbt adapters for all the engines in Cloudera Data Platform (CDP)—Apache Hive, Apache Impala, and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering

 

In addition to providing the adapters, Cloudera is offering a turn-key solution to be able to manage the end-to-end software development life cycle (SDLC) of dbt models. This solution is available on all clouds, as well as on-prem deployments of CDP. 

 

In this article, we show how our customer data teams can streamline their data transformation pipelines in the Cloudera Data Platform and deliver high-quality data that their business can trust. Our solution satisfies the stringent security and privacy requirements of our customers while providing an easy-to-use turnkey solution for practitioners.

dbt’s end-to-end software development life cycle

A key advantage of using dbt is that it provides a framework for analysts to easily follow software engineering best practices for their SQL transformation pipelines. Instead of the typical ad hoc scripting resulting in brittle pipelines, analysts can leverage engineering best practices to build robust, tested, and documented pipelines that produce high-quality data sets that can be trusted by the business. 

hajmera_0-1667948699473.png

Figure 1: Software development life cycle of dbt models

 

As shown in Figure 1, a dbt user’s workflow typically consists of the following phases: 

  1. Develop - Multiple analysts can independently clone the project to modify or write new SQL models and push changes back to the main branch.
  2. Test - While making changes to models, analysts can include tests to validate data quality. These tests can also be run in production. 
  3. Deploy - dbt encourages a gitops flow for deployment. So, when a change needs to be deployed, it is first expected to be committed into git. This process acts as a forcing function that simplifies setting up a CI/CD pipeline that automatically deploys changes that are committed into git.
  4. Operate - Once deployed, operating dbt in production includes orchestration and viewing documentation.
  5. Monitor/debug - One of the key aspects of operating dbt in production is monitoring for failures, data quality test failures, and debugging any issues. The logs generated by dbt have a lot of this information and are consumable even by analysts who are not super technical. 

What it takes to provide an end-to-end solution for dbt

In order for any customer to use dbt core and the adapters to build their transformation pipelines, a lot of scaffolding needs to be available. Cloudera has identified the requirements of such a scaffolding to enable secure and simple workflows for analysts and has provided guides to bring up such a scaffolding natively within the  Cloudera Data Platform. 

  • Multiple Environments

    dbt makes it easy to maintain separate production and development environments through the use of targets within a profile. So, any deployment of dbt needs to support multiple environments, to support the different steps in the software development lifecycle, that are isolated from each other. For example, 
    • Dev environment to be used by analysts to edit and test their models and documentation
    • Stage environment to be used for automated testing of committed model and documentation changes
    • Prod environment to be used to build and run models to generate production data sets. Consumers of the models will typically access these production data sets and corresponding documentation
  • Isolated development workspaces

    Different analysts should be able to make changes and test models without affecting the work being performed by other analysts. Also, different analysts may have access to different models. So, analysts should be able to only make changes to models that they have access to.
  • CI/CD pipeline

    Any deployment of dbt should allow for changes made to models and documentation to be automatically tested and promoted to production environments. This automation requires the availability of a system to manage workflows.
  • Orchestration

    Typically models need to be refreshed or updated on a regular basis whenever the underlying source data is updated. So, any deployment of dbt should have a mechanism to run a dbt model refresh or update on a schedule, or based on events like Kafka.
  • Easy access to documentation

    dbt provides a way to generate documentation for a dbt project and render it as a website. Documentation of dbt models helps downstream consumers discover and understand the datasets which are curated for them. Though this documentation can be accessed locally, it can also be hosted remotely and can be accessible to others on the team. Any deployment of dbt should have a way to access this documentation.
  • Web-based UI to develop, & deploy in one place

    In order for analysts to build self-serving data pipelines, it’s necessary that development, testing, and deployment of SQL models via a CI/CD pipeline can be done from a single interface, without any dependencies on different teams or tools. Any deployment of dbt should offer a single application experience to the analysts.
    • Easy access to logs
      dbt generates logs that are helpful in debugging issues in models as well as investigating performance problems. Any deployment of dbt should allow for these logs to be readily accessible to analysts without requiring any complicated setup.
    • Monitoring & alerting
      Any dbt deployment should have monitoring and alerting capabilities for dbt jobs. This lets the IT team know if there are any issues or job failures.
  • Managed software artifacts

    dbt Core is an open-source project. It is updated from time to time with new features and performance improvements. In addition, Common Vulnerabilities and Exposures (CVE) need to be fixed. Any dbt deployment should offer a seamless way to upgrade core and adapters without having to manage software artifact versions.

Cloudera solution:

Cloudera has provided a managed software package of dbt core and all adapters for CDP engines that is maintained and supported by Cloudera. Watch  dbt working in CDP Public Cloud.

Demo video:

dbt on Cloudera Data Platform 

The dbt integration with CDP is brought to you by Cloudera’s Innovation Accelerator, a cross-functional team that identifies new industry trends and creates new products and partnerships that dramatically improve the lives of our Cloudera customers’ data practitioners. Learn more with Cloudera’s simple guides to deploy and run dbt in all form factors supported by Cloudera for a truly hybrid solution.

To learn more, contact us at innovation-feedback@cloudera.com.