Community Articles
Find and share helpful community-sourced technical articles
Cloudera Employee

Accessing AWS Cloudera Data Warehouse to query data on Azure Cloudera Data Warehouse

Introduction

 

Cloudera Data Platform enables in a single console work with different public cloud providers, with this you can have a true hybrid environment with only one admin console.

 

Cloudera Data Warehouse is a public cloud service that allows fast analytics in your preferred cloud provider.

 

In this article I'll show how easy it is to connect between two Virtual Warehouses located in different cloud providers using Cloudera Data Warehouse.

Scenario

 

We're using two different cloud providers for Cloudera Data Warehouse, one in AWS with TPC-DS data and another in Azure with the same TPC-DS data but we'll use Hive ACID to update the customer table on Azure and merge it with customer table in AWS.

 

carrossoni_0-1623358146459.png

1.Pre-Requisites

 

1.1 - Cloudera CDP Control Plane Access and Register two environments

 

For this exercise you'll need access to the Cloudera Data Platform more information can be accessed on: https://docs.cloudera.com/cdp/latest/index.html

 

Also since we will use two environments (AWS and Azure), we need to register the environments on the CDP control plane.

 

For AWS: https://docs.cloudera.com/management-console/cloud/environments/topics/mc-environments.html

 

For Azure:

https://docs.cloudera.com/management-console/cloud/environments-azure/topics/mc-azure-environments.h...

carrossoni_1-1623358146510.png

Figure 1: Environments Used.

 

Now we can set up the Virtual Warehouses for each environment that we'll work with the data.

2.1 - Create two Virtual Warehouses

After the environment automatic setup we can activate in Cloudera Data Warehouse:

carrossoni_2-1623358146489.png

Figure 2: Environments activated on Cloudera Data Warehouse experience.

 

And create our virtual warehouses:

carrossoni_3-1623358146503.png

Figure 3: Data Warehouse's created in different Environments.

 

Now we can go to the next part, to start our analysis between those two environments. We'll use the JDBC Storage Handler to communicate between one environment to another.

2.Prepare database/table 

For each Virtual Warehouse we've uploaded on the bucket the TPC-DS data and created the tables.

2.1 - Change address for customer on Azure using ACID features

For the table customer, we want to change their address with a new register in Azure environment and reflect it on AWS environment:

 

First access Hue on Azure Cloudera Data Warehouse on Cloudera Data Warehouse UI and execute the next steps:

carrossoni_4-1623358146566.png

Figure 5: Open Hue in Azure Cloudera Data Warehouse

 

Now we can perform the select for the registry that we want to change:

select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country 

from tpcds.customer c, customer_address ca

where c.c_current_addr_sk = ca.ca_address_sk

and c.c_customer_sk = 11316001;

 

carrossoni_5-1623358146208.png

Figure 6: Address that we want to change.

 

First we'll insert a new register on the customer_address table, for this we need to find the last number registered so we won't collide with any current address:

carrossoni_6-1623358146501.png

Figure 7: Max ID

 

Now that we've this we can insert the new address:

insert into tpcds.customer_address

values (6000001, "AAAAAAAACGICKEAA", "5470", "Great America", "Pkwy", NULL, 

"Santa Clara", "Santa Clara County", "CA", "95054", "United States", -7.00,

"Business");

 

And after this we can update the customer information with the new id and check it with the same query that we've run first:

carrossoni_7-1623358146564.png

Figure 8: Updated Address in Azure

2.2 - Create the External JDBC Table to connect from AWS Cloudera VW to Azure Cloudera VW

 

Now that we've the data on Azure, let's access Cloudera Data Warehouse created on AWS in Hue using the same method that we've accessed Hue in Azure

 

In this example we've already the same schema/tables created in this environment with the data stored in S3 instead ADLS.

carrossoni_0-1623425187513.png

Figure 9: Schema of tables in AWS Cloudera Environment

 

Now in this AWS we want to create the customer and address table pointing to the tables located in the Azure Virtual Warehouse:

Creating Customer Azure External Table in AWS Cloudera VW:

 

For this step we need the Azure Virtual Warehouse JDBC address, we can get in Cloudera Data Warehouse UI in the Copy JDBC URL button:

carrossoni_9-1623358146562.png

Figure 10: Copy JDBC URL from Azure Cloudera Virtual Warehouse

 

Now we can execute the following script to create the JDBC tables (please change the "hive.sql.jdbc.url" value with the JDBC address from your Virtual Warehouse and also the user/password with your CDP user/password :

use tpcds;


CREATE TABLE `customer_azure`(

  `c_customer_sk` int, 

  `c_customer_id` string, 

  `c_current_cdemo_sk` int, 

  `c_current_hdemo_sk` int, 

  `c_current_addr_sk` int, 

  `c_first_shipto_date_sk` int, 

  `c_first_sales_date_sk` int, 

  `c_salutation` string, 

  `c_first_name` string, 

  `c_last_name` string, 

  `c_preferred_cust_flag` string, 

  `c_birth_day` int, 

  `c_birth_month` int, 

  `c_birth_year` int, 

  `c_birth_country` string, 

  `c_login` string, 

  `c_email_address` string, 

  `c_last_review_date` string)

STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'

TBLPROPERTIES (

  "hive.sql.database.type" = "MYSQL",

  "hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver",

  "hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>",

  "hive.sql.dbcp.username" = "<CDPUSERNAME",

  "hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>",

  "hive.sql.query" = "SELECT c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date_sk from tpcds.customer",

  "hive.sql.dbcp.maxActive" = "1"

);


CREATE TABLE `customer_address_azure`(

  `ca_address_sk` int, 

  `ca_address_id` string, 

  `ca_street_number` string, 

  `ca_street_name` string, 

  `ca_street_type` string, 

  `ca_suite_number` string, 

  `ca_city` string, 

  `ca_county` string, 

  `ca_state` string, 

  `ca_zip` string, 

  `ca_country` string, 

  `ca_gmt_offset` float, 

  `ca_location_type` string)

STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'

TBLPROPERTIES (

  "hive.sql.database.type" = "MYSQL",

  "hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver",

  "hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>",

  "hive.sql.dbcp.username" = "<CDPUSERNAME",

  "hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>",

  "hive.sql.query" = "SELECT ca_address_sk,ca_address_id,ca_street_number,ca_street_name,ca_street_type,ca_suite_number,ca_city,ca_county,ca_state,ca_zip,ca_country,ca_gmt_offset,ca_location_type from tpcds.customer_address",

  "hive.sql.dbcp.maxActive" = "1"

);

 

carrossoni_1-1623425249161.png

Figure 11: Azure JDBC Tables created in AWS Cloudera Virtual Warehouse

 

Now we can query this tables in AWS Cloudera Virtual Warehouse:

select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country 

from tpcds.customer_azure c, tpcds.customer_address_azure ca

where c.c_current_addr_sk = ca.ca_address_sk

and c.c_customer_sk = 11316001;

 

carrossoni_11-1623358146505.png

Figure 12: Results from Azure Cloudera DW environment in AWS Cloudera DW Environment.

 

Just to validate we can run the same query in AWS Cloudera environment with the Original tables we can see that the result is different:

carrossoni_2-1623425302912.png

Figure 13: Results from AWS Cloudera VW.

 

3.Update data on Cloudera AWS Data Warehouse using Cloudera Azure Data Warehouse tables 

 

Now we can go to the last step of this blog and the easier one, we will use HIVE ACID features to refresh the data from the source tables.

3.1 Use ACID with MERGE syntax to upsert the Customer and Customer Address tables:

Cloudera ACID provide a powerful option to perform upsert that can be also used by Slow Changng Dimensions: https://blog.cloudera.com/update-hive-tables-easy-way-2/

 

First we'll update the AWS customer table based on the results of the customer_azure table:

 

use tpcds;


merge into

 tpcds.customer

using

 tpcds.customer_azure as caz

on

 customer.c_customer_sk = caz.c_customer_sk

 and caz.c_customer_sk = 11316001

when matched then

 update set c_current_addr_sk = caz.c_current_addr_sk;

 

carrossoni_13-1623358146560.png

Figure 14: Updating Customer table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source 

 

Note that we're not inserting a register in case it's not matched and we're not updating other fields since we only want to demonstrate the address in this example, but this is completely possible. Also in the WHERE clause we're defining the customer_sk to match to one register just for this example.

 

merge into

 tpcds.customer_address

using

 tpcds.customer_address_azure as cadaz

on

 customer_address.ca_address_sk = cadaz.ca_address_sk

 and cadaz.ca_address_sk = 6000001

when not matched then

 insert values (cadaz.ca_address_sk,cadaz.ca_address_id,cadaz.ca_street_number,cadaz.ca_street_name,cadaz.ca_street_type,cadaz.ca_suite_number,cadaz.ca_city,cadaz.ca_county,cadaz.ca_state,cadaz.ca_zip,cadaz.ca_country,cadaz.ca_gmt_offset,cadaz.ca_location_type);

carrossoni_14-1623358146565.png

Figure 15: Insert new data into Customer Address table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source 

 

Now that we have updated/insert new data we can check the data on AWS with the same query that we've executed in Azure:

select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country 

from tpcds.customer c, customer_address ca

where c.c_current_addr_sk = ca.ca_address_sk

and c.c_customer_sk = 11316001;

 

carrossoni_3-1623425353594.png

Figure 16: Fresh data into AWS Cloudera Data Warehouse environment with the same view as Azure Cloudera Data Warehouse environment.

4.Conclusion and Going Further 

 

With this we've demonstrated how to access an Azure Cloudera Data Warehouse environment from a AWS Cloudera Data Warehouse environment and use Hive ACID features to upsert the data.

 

Going further this can be used as an hybrid multi cloud strategy where one Cloudera environment can be used for Machine Learning and other for Data Warehouse (Or DEV/PROD strategy), also this data/metadata that we've created can be accessed from other experiences like Data Engineering, Cloudera Machine Learning Data Flows, Data Hubs to have a complete end-to-end scenario

 

We can also extend Cloudera Data Engineering with Airflow to schedule the refresh so this can be periodically done.

https://docs.cloudera.com/data-engineering/cloud/manage-jobs/topics/cde-airflow-dag-pipeline.html

5.Bonus: Using Impala and Cloudera Viz to present the ACID Data 

 

With Cloudera Viz in a Impala Cloudera Virtual Warehouse using the same AWS environment that we've used on the steps above we can create the model:

carrossoni_16-1623358146629.png

Figure 17: Data modelling on Cloudera VIZ under a Impala AWS Data Warehouse.


Creating the Dashboard is pretty easy since we can use the options that Viz show automatically based on data types:

carrossoni_17-1623358146635.png

Figure 18: Dashboard Customer Address Creation.

 

And we can filter the data to see if the City that we've added it's there:

carrossoni_18-1623358146568.png

Figure 19: Updated ACID data read via Impala.

6.Summary 

 

We've passed through a lot of concepts on this blog post like:

 

  • How to operate in a Hybrid Cloud Warehouse scenario using Cloudera Data Platform and Cloudera Data Warehouse;
  • Hive ACID features;
  • Cloudera Viz;
  • Impala ACID read;

 

More details on each feature can be searched on this community, stay tuned for more posts!

90 Views
0 Kudos