Community Articles

Find and share helpful community-sourced technical articles.
avatar
Contributor

dbt-impala.png

Overview

In this post we will discuss using dbt with the Cloudera Data Platform, and show you how to get started by connecting dbt to your Impala Data Warehouse. You’ll also find links to a dbt example project that you can use to bootstrap your dbt journey.

Version Notes:

The adapter has been tested on the following version:

  • python: 3.9
  • Cloudera Data Engineering release (1.15-h1)
  • dbt-core: 1.3.0

Cloudera Data Warehouse

Cloudera Data Warehouse (CDW) is a CDP Public Cloud service for self-service creation of independent data warehouses and data marts that autoscale up and down to meet your varying workload demands. The Data Warehouse service provides isolated compute instances for each data warehouse/mart, automatic optimization, and enables you to save costs while meeting SLAs. Both Apache Impala and Apache Hive are available through Cloudera Data Warehouse.

What is dbt?

dbt is quickly gaining popularity as a key component of the modern data stack; a tool that enables the creation of data pipelines & analytics projects using only SQL.

 

In the words of dbtLabs:

“dbt™ is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines.”

Why dbt & CDW?

dbt leverages your existing warehouse to run your workflows, meaning you avoid the complexities of additional hardware/tools/clusters for extracting, transforming and then loading back into the warehouse. 

How to: use dbt with Impala on Cloudera Data Warehouse 

Installing dbt

To use dbt with Impala, you need the following python packages: dbt-core, dbt-impala and impyla.

 

Start by cloning the demo repository at https://github.com/cloudera/dbt-impala-example with:

git clone https://github.com/cloudera/dbt-impala-example.git

 

Inside this repo is a file called requirements.txt, which you can use to install the python dependencies. Install the requirements using pip:

pip install -r requirements.txt

Setup CDP

To work with Impala in CDP, we need two things - an Impala Virtual Warehouse (part of Cloudera Data Warehouse) and a user to run queries.

 

User

For this demo, we will use a Machine User created inside CDP as the user running queries. Inside CDP > User Management, add a new Machine User and set the workload password. 

 

The steps for this are documented here:

  1. Creating a Machine User
    1. https://docs.cloudera.com/management-console/cloud/user-management/topics/mc-create-machine-user.htm... 
  2. Setting the workload password
    1. https://docs.cloudera.com/management-console/cloud/user-management/topics/mc-setting-the-ipa-passwor... 

With the user created & the workload password set, take a note of the Workload username & password. Notice in the below screenshot, for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’. 

alex_wood_0-1651165263767.png

 

Keep the workload user & password details handy for later.

 

Cloudera Data Warehouse Impala

We will be using Impala through Cloudera Data Warehouse - a cloud-native, auto-scaling deployment of Impala.

 

Start by activating your CDW Environment as documented here: https://docs.cloudera.com/data-warehouse/cloud/aws-environments/topics/dw-activating-environments-4-... 

 

This will create a default database catalog, which we will use in this demo. You are able to create non-default database catalogs, as documented here: https://docs.cloudera.com/data-warehouse/cloud/managing-warehouses/topics/dw-adding-new-database-cat... 

 

Next, create an Impala Virtual Warehouse connected to the default database catalog, as documented here: https://docs.cloudera.com/data-warehouse/cloud/managing-warehouses/topics/dw-adding-new-virtual-ware... 

 

The following settings we will be used for Impala Virtual Warehouse in this demo:

alex_wood_1-1651165263803.png

 

Once created, you should see that the Virtual Warehouse enters the running state

alex_wood_2-1651165264073.png

 

From here, select the 3 dots, and then Copy JDBC URL. In my case, this looks like: 

jdbc:impala://coordinator-cia-dbt-impala.dw-ciadev.a465-9q4k.cloudera.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=abrown;PWD=PASSWORD

 

alex_wood_3-1651165263778.png

 

Keep this URL handy for later.

Connecting dbt to CDW

 

dbt requires that we configure a profile that defines how to connect to our data warehouse. For this, we need the workload credentials & Impala connection details we collected earlier.

 

The profile lives in a `.dbt` directory in your home directory and is called `profiles.yml`. On Linux, this would look like `~/.dbt/profiles.yml`. If you haven't used dbt before, create the directory with `mkdir ~/.dbt` and create the `profiles.yml` file with your favourite text editor.

 

You can learn more about the dbt profile from the dbt docs here https://docs.getdbt.com/dbt-cli/configure-your-profile 

 

Use the following template for the contents of the file:

dbt_impala_demo:
  outputs:
    dev:
     type: impala
     host: <Impala Hostname>
     port: 443
     dbname: dbt_impala_demo
     schema: dbt_impala_demo
     user: <Workload Username>
     password: <Workload Password>
     auth_type: ldap
     use_http_transport: true
     use_ssl: true
     http_path: cliservice
  target: dev

 

First, add your Workload user/pass to the “user” and “password” fields.

 

Next, we need to extract the Impala hostname from the JDBC URL we copied earlier. We do not want to use the entire JDBC URL.

jdbc:impala://coordinator-cia-dbt-impala.dw-ciadev.a465-9q4k.cloudera.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=abrown;PWD=PASSWORD

 

Given the above JDBC URL - we want to extract the hostname between the protocol (“jdbc:impala://”) and the port (“:443”). The result is:

coordinator-cia-dbt-impala.dw-ciadev.a465-9q4k.cloudera.site

 

Use the extracted hostname for the “host” field in the template.

 

My completed profile looks like this:

dbt_impala_demo:
  outputs:
    dev:
     type: impala
     host: coordinator-cia-dbt-impala.dw-ciadev.a465-9q4k.cloudera.site
     port: 443
     dbname: dbt_impala_demo
     schema: dbt_impala_demo
     user: srv_cia_test_user
     password: Password123!
     auth_type: ldap
     use_http_transport: true
     use_ssl: true
     http_path: cliservice
  target: dev

 

To ensure we’ve configured our profile correctly, let’s run a connection test. For this we use the command:

dbt debug

 

In the output of this command, you should see the following:

Connection:
  host: coordinator-cia-dbt-impala.dw-ciadev.a465-9q4k.cloudera.site
  port: 443
  database: dbt_impala_demo
  schema: dbt_impala_demo
  username: srv_cia_test_user
  Connection test: [OK connection ok]

 

This confirms a successful connection to the Impala warehouse.

Running the demo project

 

In the example repo we cloned at the start, we have a demo dbt project called ‘dbt_impala_demo’.

 

Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples for: generating fake data, tests, seeds, sources, view models & incremental table models.

 

To run the seeds, use:

dbt seed

 

To run the tests, use:

dbt test

 

To run the models, use:

dbt run

 

You can also generate dbt documentation using:

dbt docs generate

 

The README for the example repo going into further detail about using the demo project: https://github.com/cloudera/dbt-impala-example/blob/main/README.md and we’ll cover it in further detail in a later Community post.

Conclusion

 

We have covered a quick intro to dbt, and worked through setting up our environment to get dbt connected to Cloudera Data Warehouse. We’ve also introduced the example repo to help bootstrap your journey to CDP.

 

In a later post we’ll cover the example repo in more detail and demonstrate some real use cases for dbt. 

 

If you have any questions or feedback related to dbt on the Cloudera Data Platform, please reach out to us via this community, or drop us an email at innovation-feedback@cloudera.com 

6,262 Views
0 Kudos
Comments
avatar
Explorer

Cool stuff! It does already look very mature to me.

Does it matter if Impala runs in Unified Analytics mode? If yes, does it support UA or do you have plans to support  it?

 

 

avatar
Contributor

Hi @DanielR! Right now, we aren't recommending to use the adapter with a UA enabled VW as we have not thoroughly tested with UA, but it's definitely something that we'd like to support in the near future. We're very open to feedback, so if you'd like to test it with UA, feel free to give it a go - let us know the good, the bad & the ugly!

avatar
Contributor

Hi @sdairs1

 

we try to setup from pip, and get below error :

ERROR: Could not find a version that satisfies the requirement dbt-core>=1.1.0

 

any help appreciate

avatar
Explorer

Hi @wbivp 
Which OS are you trying this on?
Also make sure you have python 3.7 or higher installed. 
Could you try:
pip install dbt-impala==1.1.4 

avatar
Contributor

Hi @tovganesh 

 

Using RHEL 7 and CDP

with python v3.6 

 

ok i will try first

avatar
Explorer

Hi @wbivp I think you may have to update your python environment as dot-core 1.1.x needs at least 3.7

avatar
Contributor

Hi @tovganesh 

 

our private cluster using kerberos and tls, how to setup dbt profile for this requirement ?

 

 

Thanks

avatar
Contributor

thank for the info, i will try

avatar
Explorer
 
I have tested dbt with kerberos configuration, but when I debug the project I got this following error
 
1 check failed:
dbt was unable to connect to the specified database.
The database returned the following error:
 
>Runtime Error
(('Unspecified GSS failure. Minor code may provide more information', 851968), ('Server not found in Kerberos database', -1765328377))
 
Check your database credentials and try again. For more information, visit:
 
And this is my profile config
impala_testing:
outputs:
dev:
type: impala
host: cdpgateway
port: 21000
dbname: testdb
schema: testdb
user: admin
password: admin
auth_type: GSSAPI
kerberos_service_name: None
use_http_transport: true # default value: true
use_ssl: true # TLS should always be used with LDAP to ensure secure transmission of credentials, default value: true
retries: 3
target: dev