Community Articles

hajmera · ‎07-11-2022

Overview

This article focuses on Cloudera’s new adapter dbt-spark-livy that enables users to use dbt on top of Apache Spark by connecting to it using Apache Livy. Livy is a REST web service for managing long-running Spark Contexts and submitting Spark jobs. Instead of running the Spark Contexts in the Server itself, Livy manages Contexts running on the cluster managed by a Resource Manager like YARN.

Cloudera Data Platform provides Spark with Livy in the Datahub form factor. Links to a dbt example project that can be used to bootstrap a user’s dbt journey can be found below.

Version Notes:

The adapter has been tested on the following version:

python: 3.9.13
Cloudera Manager CM VERSION 7.6.2 RUNTIME VERSION 7.2.15
CDP Datahub with the template - 7.2.14 - Data Engineering: Apache Spark2/Spark3
dbt-core: 1.3.0

Installing dbt and requirements

Prerequisites:

Make sure python3, git, and pip are available on your system.

Cloning the repo:

Run the following command in your terminal to clone the example project repository:
```
git clone https://github.com/cloudera/dbt-spark-livy-example/
```

Installing the packages:

Run the following commands in your terminal to install the required packages/dependencies using pip/pip3:
```
pip install dbt-core
pip install dbt-spark-livy
```

CDP and Spark

For our development and testing, we used Spark included in Cloudera Runtime version 7. x.x with a data hub cluster. The cluster includes Spark, Livy, Hive, Zeppelin, and Oozie, along with supporting services (HDFS, YARN, and Zookeeper). More info on data hub clusters and templates at Cloudera data hub. You need a data cluster similar to this to be able to test the adapter. Also, note that because of using the datahub cluster there is a Hive Metastore Service (HMS) running in the cluster. Spark interacts with the Hive metastore which is already configured during the creation of the datahub. All datahub clusters are attached to a data lake that runs within an environment (in our case AWS) and provides security and governance for the environment's clusters. A Data Lake storage is created when a Data Lake is first instantiated for an environment. We have AWS S3 as our storage that hive metastore interacts with and uses.

Setting up Cloudera Data Platform (CDP) Spark Datahub Cluster

To work with Datahub Spark in CDP we need a machine user to run the queries.

Creating a machine user

For this demo, we will use a Machine User, created inside CDP as the user running queries. Inside CDP > User Management, add a new Machine User and set the workload password.

The steps for this are documented here:

With the machine user created and the workload password set, take a note of the Workload User Name & Workload Password. Notice in the below screenshot, that for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’.

Note: Keep the Workload User Name & Workload Password details handy for later.

Create Cloudera Datahub Spark cluster

We start by creating a datahub as documented here
Note: If you already have a datahub that has spark, livy, and hive services already created you can skip the creation procedure and note down the endpoints mentioned in step 4 below.
For our use case, we choose a cluster that provides Spark. We went with 7.2.14 - Data Engineering: Apache Spark3. You can choose one of the default available clusters that includes spark, livy, and hive services to create this data hub or customize your own cluster. Once up and running you should have a datahub cluster similar to the below.
Note: The steps are similar if you create a data hub providing Apache Spark(spark v2) as the service instead of spark3 (spark v3).
With the data hub cluster in running state, you can view the details of the Data Hub by clicking on it. The livy server UI and other relevant addresses are accessible from the same page.
Go to Endpoints and note down the Livy Server 3/Livy server endpoint depending on your version of the spark that you used to create the datahub cluster. This is the endpoint dbt will be connecting to

Connecting dbt to Spark-Livy

dbt needs a profile to connect to your data warehouse. The profile lives in a .dbt directory in your home directory and is called profiles.yml.

On Linux, this would look like ~/.dbt/profiles.yml. If you haven't used dbt before, create the directory with the command:

mkdir ~/.dbt

and create the profiles.yml file with your favorite text editor.

You can learn more about the dbt profile from the dbt docs here

Use the following template for the contents of the file:

dbt_spark_livy_demo:
  outputs:
    dev:
     type: spark_livy
     method: livy
     schema: <schemaname>
     dbname: <dbname>
     host: <hostname>
     threads: 2 
     user: <username>
     password: <password>
  target: dev

First, add your Workload User Name and Workload Password to the “user” and “password” fields.

Next, add the “host” field in the template we noted earlier from CDP UI (step 5 of Create Cloudera Data HUB Spark account)

The sample completed profile looks like this:

dbt_spark_livy_demo:
  outputs:
    dev:
      type: spark_livy
      method: livy
      schema: dbt_spark_livy_demo
      dbname: dbt_spark_livy_demo
      host: https://dbt-spark-2-gateway.ciadev.cna2-sx9y.cloudera.site/dbt-spark-2/cdp-proxy-api/livy/
      threads: 2
      user: srv_cia_test_user
      password: Password123!
  target: dev

Note: Because spark uses hive metastore in the background remember to use proper syntax for dbname and schema name. Valid names should only include letters, numbers, and underscores {a-zA-z_0-9). No dashes or hyphens are allowed.

To ensure we’ve configured our profile correctly, run a connection test. In the project directory, for us dbt_spark_livy_demo run the following command:

dbt debug

In the output of this command, you should see the following: All checks passed similar to this.

Running the demo project

In the example repo, we have a sample dbt project called ‘dbt_spark_demo’.

Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands:

Generating fake data
dbt seed
dbt test
dbt run
Sources
View models & incremental table models

More commands and detailed descriptions of each command can be found here

Generate fake data

Move to the util/data_gen folder inside dbt-spark-example and run the following command:

python generate_data.py --days 2 --start-date 2022-01-01

This generates 2 days of fake data for the dates 01/01/2022 and 02/01/2022. You can modify the parameters like days and start-date to generate more fake data.

Generated two files raw_covid__cases.csv and raw_covid__vaccines.csv can be found in the util/data_gen/data/ folder. Copy these two files to the seeds folder inside the dbt_spark_livy_demo dbt project directory.

Loading fake data

To run the seeds, move to our dbt project dbt_spark_livy_demo inside our cloned project dbt-spark-livy-example and run the following command:
```
dbt seed
```
The terminal should look like this:
This will load the reference data into the warehouse. Users can visualize this through hue UI

Running tests:

Our Seeds are configured with a couple of tests. Users can read more about it here

We also have a custom test created in dbt_spark_livy_demo/tests/generic/test_length.sql

Description: This test is used to check the character length of a column. Our reference data has columns that include ISO Alpha2 and Alpha3 country codes - we know that these columns should always be 2 or 3 columns respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results.

We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null.

Run the following command to tests:
```
dbt test
```
The terminal should look something like this. Don’t be concerned about the errors::

Running models:

We have 3 sets of models in this demo project.

Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to. This is the fake data we loaded earlier.

Our raw models are defined in models/raw/covid

Next, we have staging. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models.

Our staging models are defined in models/staging/covid

Lastly, we have mart. These are Models. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for type casting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, giving a greater performance for user queries. We can use incremental models to make building the model more performant.

Our mart models are defined in models/mart/covid

Run models with the following commands:
```
dbt run
```
The terminal should look like this:

Reviewing the data and models

Reviewing the database and table in Hue UI

User should now have the following databases & tables:

reference (database)

ref__population (table)
ref__country_codes (table)

raw_covid (database)

raw_covid__vaccines (table)
raw_covid__cases (table)

staging_covid (database)

stg_covid__cases (view)

mart_covid (database)

covid_cases (table)

In the raw, staging, and mart tables, you should see 2 days' worth of data for the dates 01/01/2022 and 02/01/2022. You can generate more fake data and play around.

Generating and viewing dbt documentation:

Run the following command to generate the documentation:
```
dbt docs generate
```
The terminal should look like this:
Users can read more about dbt docs here
Run the following command to open the documentation on the browser:
```
dbt docs serve –port 8001
```
The dbt docs serve command is only intended for local/development hosting of the documentation site. You can view it locally through http://localhost:8001/ or the port number you used to create the dbt docs serve.
The terminal should look like this:

Notes: dbt's documentation website was built in a way that makes it easy to host on the web. The site itself is "static", meaning that you don't need any type of "dynamic" server to serve the docs. Sample dbt docs site can be viewed here. Some common methods for hosting the docs are:
1. Host on S3
2. Publish on Netlify
3. Spin up a web server like Apache/Nginx

Cloudera Community

Community Articles

Getting started with dbt-spark-livy adapter

Cloudera Innovation Accelerator

DBT Integration

Overview

Version Notes:

Installing dbt and requirements

Prerequisites:

Cloning the repo:

Installing the packages:

CDP and Spark

Setting up Cloudera Data Platform (CDP) Spark Datahub Cluster

Creating a machine user

Create Cloudera Datahub Spark cluster

Connecting dbt to Spark-Livy

Running the demo project

Generate fake data

Loading fake data

Running tests:

Running models:

Reviewing the data and models

Reviewing the database and table in Hue UI

Generating and viewing dbt documentation:

Getting started with dbt-spark-cde adapter

[ANNOUNCE] Updates to the dbt adapters for Apache ...

Getting Started with Spark GraphFrames in Cloudera...

Troubleshooting Guide: dbt adapters

Getting started with dbt-impala

Getting Started with MiNiFi

Getting started with dbt-hive

[VIDEO] How to get started with dbt within CDP

Getting started with SAS and Hadoop

Getting started with Pig-Eclipse