Created on 04-28-202210:03 AM - last edited on 11-22-202205:14 PM by hajmera
In this post we will discuss using dbt with the Cloudera Data Platform, and show you how to get started by connecting dbt to your Impala Data Warehouse. You’ll also find links to a dbt example project that you can use to bootstrap your dbt journey.
The adapter has been tested on the following version:
Cloudera Data Engineering release (1.15-h1)
Cloudera Data Warehouse
Cloudera Data Warehouse (CDW) is a CDP Public Cloud service for self-service creation of independent data warehouses and data marts that autoscale up and down to meet your varying workload demands. The Data Warehouse service provides isolated compute instances for each data warehouse/mart, automatic optimization, and enables you to save costs while meeting SLAs. Both Apache Impala and Apache Hive are available through Cloudera Data Warehouse.
What is dbt?
dbt is quickly gaining popularity as a key component of the modern data stack; a tool that enables the creation of data pipelines & analytics projects using only SQL.
In the words of dbtLabs:
“dbt™ is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines.”
Why dbt & CDW?
dbt leverages your existing warehouse to run your workflows, meaning you avoid the complexities of additional hardware/tools/clusters for extracting, transforming and then loading back into the warehouse.
How to: use dbt with Impala on Cloudera Data Warehouse
To use dbt with Impala, you need the following python packages: dbt-core, dbt-impala and impyla.
With the user created & the workload password set, take a note of the Workload username & password. Notice in the below screenshot, for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’.
Keep the workload user & password details handy for later.
Cloudera Data Warehouse Impala
We will be using Impala through Cloudera Data Warehouse - a cloud-native, auto-scaling deployment of Impala.
dbt requires that we configure a profile that defines how to connect to our data warehouse. For this, we need the workload credentials & Impala connection details we collected earlier.
The profile lives in a `.dbt` directory in your home directory and is called `profiles.yml`. On Linux, this would look like `~/.dbt/profiles.yml`. If you haven't used dbt before, create the directory with `mkdir ~/.dbt` and create the `profiles.yml` file with your favourite text editor.
This confirms a successful connection to the Impala warehouse.
Running the demo project
In the example repo we cloned at the start, we have a demo dbt project called ‘dbt_impala_demo’.
Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples for: generating fake data, tests, seeds, sources, view models & incremental table models.
We have covered a quick intro to dbt, and worked through setting up our environment to get dbt connected to Cloudera Data Warehouse. We’ve also introduced the example repo to help bootstrap your journey to CDP.
In a later post we’ll cover the example repo in more detail and demonstrate some real use cases for dbt.
If you have any questions or feedback related to dbt on the Cloudera Data Platform, please reach out to us via this community, or drop us an email at firstname.lastname@example.org