Created on 05-17-2020 02:58 PM - edited on 06-10-2020 10:54 PM by VidyaSargur
How do we quickly gain insight and start working with data in a secure, governed, and scalable environment in the cloud?
This article explains how to achieve this using the Cloudera Data Warehouse platform connected with Apache Superset.
Cloudera Data Warehouse in CDP (Cloudera Data Platform) is an enterprise solution for modern analytics. It's an auto-scaling, highly concurrent, and cost-effective hybrid, a multi-cloud analytics solution that ingests data anywhere, at massive scale, from structured, unstructured, and edge sources.
Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application.
This exercise is performed on a Mac OS. The versions below were tested at the moment of writing this article and may change in the future:
Python
Python 3.7.5
pip 20.0.2
After Python/pip installation, install the following packages/versions in Python (we recommend using venv before this step):
impyla==0.16.1
thrift==0.13.0
thrift_sasl==0.2.1
Apache Superset can be installed on your machine or executed in a Docker environment. In this example, we will use the steps provided in Python Virtualenv and the version is:
apache-superset==0.999.0.dev0
After setting up the environment, you can access Superset UI with in the following address:
http://127.0.0.1:8088/
Figure 1: Welcome to Apache Superset
The default username/password is admin/admin.
If you don't have an Impala Virtual Warehouse (used in this example), you need to create one that will connect to the Database Catalog. This is a very simple step and can be done in minutes. Once you have created a virtual warehouse, if your Database Catalog already has the Tables, Security, and Metadata Definitions to be used, you or the user/application (in our case Apache Superset) can start using the platform. More information can be obtained in this link.
Figure 2: Cloudera Data Warehouse
Here, we will be using the "default-impala" Virtual Warehouse. Since the environment is not running and nobody is using it, it is not consuming any resources. After the Virtual Warehouse creation, you will need to collect the URL to connect to your environment like the following example:
Figure 3: Getting Access URL in Cloudera Data Warehouse
Once you save the access URL, you can configure the Dashboard in Apache Superset.
After the prerequisites, we'll configure the connection in Apache Superset. To start creating the dashboard in Cloudera Data Warehouse, perform the following
Figure 6: Configuring Database
jdbc:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=luizcarrossoni;PWD=PASSWORDTo:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default?auth_mechanism=PLAIN&http_path=cliservice&use_http_transport=True&use_ssl=TrueExpose in SQL Lab: CheckedAllow Multi Schema Metadata Fetch: CheckedExtra: Here, we'll pass our Cloudera Data Platform access credentials, there are other ways to do this that are more secure in Apache Superset:{ "metadata_params": {}, "engine_params": { "connect_args": { "user" : "<cdpuser>", "password" : "<password>" } } }
Database Name: Choose a name for example "ClouderaDataPlatform" SQLAlchemy URI: We'll use the Access URL that we got in Cloudera Console, we need to customize the URI in order to use impyla and the URL supported by SQLAlchemy: From: |
After providing the config information, click the Test button in the SQLAlchemy URI Field, to see if everything is working properly. If the Virtual Warehouse is in Stopped state, it'll first start the Warehouse and then you'll see that the test was successful:
Figure 7: Starting Virtual Warehouse
Figure 8: Connection Successful
Now you can save the connection and start creating your dashboards.
You can query the data in the Virtual Warehouse using SQL Lab in Superset:
Figure 9: Query Data in SQL Lab
Note: Since the table is querying the data that supposedly has PII information (ccnumber), the data comes as hashes. This is because we have the following policy in place for the user:
Figure 10: Masking Policy
To create the Dashboard using Apache Superset in Cloudera Data Platform, do the following: