Community Articles

Find and share helpful community-sourced technical articles.
avatar
Contributor

Introduction

How do we quickly gain insight and start working with data in a secure, governed, and scalable environment in the cloud?

 

This article explains how to achieve this using the Cloudera Data Warehouse platform connected with Apache Superset.

 

Cloudera Data Warehouse in CDP (Cloudera Data Platform) is an enterprise solution for modern analytics. It's an auto-scaling, highly concurrent, and cost-effective hybrid, a multi-cloud analytics solution that ingests data anywhere, at massive scale, from structured, unstructured, and edge sources.

 

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application.

Pre-Requisites

This exercise is performed on a Mac OS. The versions below were tested at the moment of writing this article and may change in the future:

 

Python

 

 

 

 

Python 3.7.5 
pip 20.0.2 

 

 

 

 


After Python/pip installation, install the following packages/versions in Python (we recommend using 
venv before this step):

 

 

 

 

impyla==0.16.1
thrift==0.13.0
thrift_sasl==0.2.1

 

 

 

 

Apache Superset Configuration

Apache Superset can be installed on your machine or executed in a Docker environment. In this example, we will use the steps provided in Python Virtualenv and the version is:

 

 

 

 

apache-superset==0.999.0.dev0

 

 

 

 

After setting up the environment, you can access Superset UI with in the following address:

 

 

 

 

 http://127.0.0.1:8088/

 

 

 

 

carrossoni_0-1589752069302.png

Figure 1: Welcome to Apache Superset

The default username/password is admin/admin.

Cloudera Data Warehouse

If you don't have an Impala Virtual Warehouse (used in this example), you need to create one that will connect to the Database Catalog. This is a very simple step and can be done in minutes. Once you have created a virtual warehouse, if your Database Catalog already has the Tables, Security, and Metadata Definitions to be used, you or the user/application (in our case Apache Superset) can start using the platform. More information can be obtained in this link.

carrossoni_1-1589752069559.png

Figure 2: Cloudera Data Warehouse

 

Here, we will be using the "default-impala" Virtual Warehouse. Since the environment is not running and nobody is using it, it is not consuming any resources. After the Virtual Warehouse creation, you will need to collect the URL to connect to your environment like the following example:

carrossoni_2-1589752069557.png

Figure 3: Getting Access URL in Cloudera Data Warehouse

Once you save the access URL, you can configure the Dashboard in Apache Superset.

Configure Cloudera Data Warehouse as Source Database

After the prerequisites, we'll configure the connection in Apache Superset. To start creating the dashboard in Cloudera Data Warehouse, perform the following

  1. Click Source > Database in the top left menu:

    carrossoni_3-1589752069491.png

                                                    Figure 4: Configuring Source Database
  2. On the top right corner click in the "Add new record" button:

    carrossoni_4-1589752069248.png

                                                      Figure 5: Add new database button
  3. Now, we need to put the configuration in the following screen:

    carrossoni_5-1589752069484.png

                                                      Figure 6: Configuring Database

 

jdbc:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=luizcarrossoni;PWD=PASSWORDTo:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default?auth_mechanism=PLAIN&http_path=cliservice&use_http_transport=True&use_ssl=TrueExpose in SQL Lab: CheckedAllow Multi Schema Metadata Fetch: CheckedExtra: Here, we'll pass our Cloudera Data Platform access credentials, there are other ways to do this that are more secure in Apache Superset:{ "metadata_params": {}, "engine_params": { "connect_args": { "user" : "<cdpuser>", "password" : "<password>" } } }

Database Name: Choose a name for example "ClouderaDataPlatform"

SQLAlchemy URI:  We'll use the Access URL that we got in Cloudera Console, we need to customize the URI in order to use impyla and the URL supported by SQLAlchemy:

From:

 

After providing the config information, click the Test button in the SQLAlchemy URI Field, to see if everything is working properly. If the Virtual Warehouse is in Stopped state, it'll first start the Warehouse and then you'll see that the test was successful:

carrossoni_6-1589752069299.png

                                                     Figure 7: Starting Virtual Warehouse

carrossoni_7-1589752069318.png

                                                Figure 8: Connection Successful

Now you can save the connection and start creating your dashboards.

Query Data through SQL Lab

You can query the data in the Virtual Warehouse using SQL Lab in Superset:

carrossoni_8-1589752069303.png

Figure 9: Query Data in SQL Lab

Note: Since the table is querying the data that supposedly has PII information (ccnumber), the data comes as hashes. This is because we have the following policy in place for the user:

carrossoni_9-1589752069271.png

 

 Figure 10: Masking Policy

Create your Dashboard

To create the Dashboard using Apache Superset in Cloudera Data Platform, do the following:

  1. Add the table as a source in the following menu:

    carrossoni_10-1589752069270.png

                                                             Figure 11: Add Table Source
  2. Add the ww_customers_data table to start creating the dashboard:

    carrossoni_11-1589752069526.png

                                                         Figure 12: Create Source Table
  3. Create Charts using the source table that is created and use the charts in a Dashboard:

carrossoni_12-1589752069526.png

 

4,259 Views