- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 08-25-2022 11:23 PM - edited 06-26-2023 05:25 AM
Overview
This handbook focuses on Cloudera’s new adapter dbt-hive that enables users to use dbt on top of the Hive data warehouse system.
Links to a dbt example project that can be used to bootstrap a user’s dbt journey can be found below.
Version Notes:
The adapter has been tested on the following version:
- python: 3.9.13
- Cloudera Manager CM VERSION 7.6.2 RUNTIME VERSION 7.2.15
- Cloudera Data Warehouse on CDP v 1.4.2-b118
- dbt-core: 1.2.0
Installing dbt and requirements
Prerequisites:
Cloning the repo:
- Run the following command in your terminal to clone the example project repository:
git clone https://github.com/cloudera/dbt-hive-example/
Installing the packages:
- Run the following commands in your terminal to install the required packages/dependencies using pip/pip3:
python3 -m pip install dbt-core
python3 -m pip install sqlparams==3.0.0
python3 -m pip install impyla==0.18
python3 -m pip install dbt-hive
Cloudera Data Warehouse
Cloudera Data Warehouse (CDW) is a CDP Public Cloud service for the self-service creation of independent data warehouses and data marts that autoscale up and down to meet varying workload demands. The Data Warehouse service provides isolated compute instances for each data warehouse/mart, automatic optimization and enables you to save costs while meeting SLAs. Both Apache Impala and Apache Hive are available through Cloudera Data Warehouse.
Setup CDP
To work with Hive in CDP, we need two things - a Hive Virtual Warehouse (part of Cloudera Data Warehouse) and a user to run queries.
User
For this demo, we will use a Machine User created inside CDP as the user running queries. Inside CDP > User Management, add a new Machine User and set the workload password.
The steps for this are documented here:
With the user created & the workload password set, take a note of the Workload username & password. Notice in the below screenshot, that for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’.
Keep the workload user & password details handy for later.
Setting up Cloudera Data Warehouse (CDW) Cluster
Cloudera Data Warehouse (CDW) Public Cloud service, you can create independent data warehouses and data marts for teams of business analysts without the overhead of bare metal deployments. CDW includes Database Catalogs and Virtual Warehouses that you use to access your data.
We will be using Hive through Cloudera Data Warehouse
Start by activating your CDW Environment as documented here
This will create a default database catalog, which we will use in this demo. You are able to create non-default database catalogs, as documented here
Next, create a Hive Virtual Warehouse connected to the default database catalog, as documented here
Choose the virtual warehouse type as a hive.
The following settings will be used for Hive Virtual Warehouse in this demo:
Once created, you should see that the Virtual Warehouse enters the running state
From here, select the 3 dots, and then Copy JDBC URL. In my case, this looks like this:
jdbc:hive2://hs2-dbt-hive.dw-ciadev.cna2-sx9y.cloudera.site/default;transportMode=http;httpPath=cliservice;socketTimeout=60;ssl=true;retries=3;
Keep this URL handy for later.
Connecting dbt to Hive warehouse
dbt needs a profile to connect to your data warehouse. The profile lives in a .dbt directory in your home directory and is called profiles.yml.
On Unix-based systems, this would look like ~/.dbt/profiles.yml. If you haven't used dbt before, create the directory with the command:
mkdir ~/.dbt
and create the profiles.yml file with your favourite text editor.
You can learn more about the dbt profile from the dbt docs here.
Use the following template for the contents of the file:
dbt_hive_demo:
outputs:
dev:
type: hive
schema: <schema_name>
dbname: <db_name>
auth_type: ldap
host: hs2-dbt-hive.dw-ciadev.cna2-sx9y.cloudera.site
threads: <no_of_threads>
user: srv_cia_test_user
password: Password123!
target: dev
First, add your Workload User Name and Workload Password to the “user” and “password” fields.
Next, add the “host” field in the template we noted earlier from CDP UI
The sample completed profile looks like this:
dbt_hive_demo:
outputs:
dev:
type: hive
schema: dbt_hive_demo
dbname: dbt_hive_demo
auth_type: ldap
host: hs2-dbt-hive.dw-ciadev.cna2-sx9y.cloudera.site
threads: 2
user: srv_cia_test_user
password: Password123!
target: dev
Note: Remember to use proper syntax for dbname and schema name. Valid names should just include letters, numbers, and underscores {a-zA-z_0-9). No dashes or hyphens are allowed.
To ensure we’ve configured our profile correctly, run a connection test. In the project directory, for us dbt_hive_demo run the following command:
dbt debug
In the output of this command, you should see the following:
All checks passed similar to this.
Running the demo project
In the example repo, we have a sample dbt project called ‘dbt_hive_demo’.
Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands:
- Generating fake data
- dbt seed
- dbt test
- dbt run
- Sources
- View models and incremental table models
More commands and a detailed description of each command can be found here.
Generate fake data
Move to the util/data_gen folder inside dbt-hive-example and run the following command:
python generate_data.py --days 2 --start-date 2022-01-01
This generates 2 days of fake data for the dates 01/01/2022 and 02/01/2022. You can modify the parameters like days and start-date to generate more fake data.
Generated two files raw_covid__cases.csv and raw_covid__vaccines.csv can be found in the util/data_gen/data/ folder. Copy these two files to the seeds folder inside the dbt_hive_demo dbt project directory.
Loading fake data
To run the seeds, move to our dbt project dbt_hive_demo inside our cloned project dbt-hive-demo and run the following command:
dbt seed
The terminal should look like this:
This will load the reference data into the warehouse. User can visualize this through hue UI
Running tests:
Our Seeds are configured with a couple of tests. Users can read more about it here.
We also have a custom test created in dbt_hive_demo/tests/generic/test_length.sql
Description: This test is used to check the character length of a column. Our reference data has columns that include ISO Alpha2 and Alpha3 country codes - we know that these columns should always be 2 or 3 columns respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results.
We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null.
Run the following command to tests:
dbt test
The terminal should look something like this. Don’t be concerned about the errors.
Running models:
We have 3 sets of models in this demo project.
Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to. This is the fake data we loaded earlier.
Our raw models are defined in models/raw/covid
Next, we have staging. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models.
Our staging models are defined in models/staging/covid
Lastly, we have mart. These are Models. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for type casting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, as this gives greater performance for user queries. We can use incremental models to make the building of the model more performant.
Our mart models are defined in models/mart/covid
- Run models with the following commands:
dbt run
The terminal should look like this:
Reviewing the data and models
Reviewing the database and table in Hue UI
User should now have the following databases & tables:
- reference (database)
- ref__population (table)
- ref__country_codes (table)
- raw_covid (database)
- raw_covid__vaccines (table)
- raw_covid__cases (table)
- staging_covid (database)
- stg_covid__cases (view)
- mart_covid (database)
- covid_cases (table)
In the raw, staging, and mart tables, you should see 2 days' worth of data for the dates 01/01/2022 and 02/01/2022. You can generate more fake data and play around.
Generating and viewing dbt documentation:
- Run the following command to generate the documentation:
dbt docs generate
The terminal should look like this:
Users can read more about dbt docs here
- Run the following command to generate the documentation:
dbt docs serve --port 8001
The dbt docs serve command is only intended for local/development hosting of the documentation site. You can view it locally through http://localhost:8001/ or the port number you used to create the dbt docs serve. The terminal should look like this:
Notes: dbt's documentation website was built in a way that makes it easy to host on the web. The site itself is "static", meaning that you don't need any type of "dynamic" server to serve the docs. Some common methods for hosting the docs are:
- Host on S3
- Publish on Netlify
- Spin up a web server like Apache/Nginx
Sample dbt docs site can be viewed here
Created on 08-26-2023 10:00 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
When I executed dbt run, an error was reported : Unable to establish connection to Hive server: Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near '/' '*' '{'