Created on 06-10-2021 02:09 PM - edited on 06-13-2021 09:06 PM by subratadas
Cloudera Data Platform enables in a single console to work with different public cloud providers. With this, you can have a true hybrid environment with only one admin console.
Cloudera Data Warehouse is a public cloud service that allows fast analytics in your preferred cloud provider.
In this article, I'll show how easy it is to connect between two Virtual Warehouses located in different cloud providers using Cloudera Data Warehouse.
We're using two different cloud providers for Cloudera Data Warehouse: one in AWS with TPC-DS data and another in Azure with the same TPC-DS data. We'll use Hive ACID to update the customer table on Azure and merge it with the customer table in AWS.
For this exercise, you will need access to the Cloudera Data Platform. More information can be accessed here.
Also, since we will use two environments (AWS and Azure), we need to register the environments on the CDP control plane.
For AWS, refer to Introduction to AWS environments
For Azure, refer to Introduction to Azure environments
Figure 1: Environments Used.
Now we can set up the virtual warehouses for each environment that will work with the data.
After the environment automatic setup, we can activate in Cloudera Data Warehouse:
Figure 2: Environments activated on Cloudera Data Warehouse experience.
And create our virtual warehouses
Figure 3: Data warehouses created in different environments
Now we can go to the next part i.e. to start our analysis between those two environments. We'll use the JDBC Storage Handler to communicate between one environment to another.
For each Virtual Warehouse, we've uploaded on the bucket the TPC-DS data and created the tables.
For the table customer, we want to change their address with a new register in Azure environment and reflect it on AWS environment:
First access Hue on Azure Cloudera Data Warehouse on Cloudera Data Warehouse UI and execute the next steps:
Figure 5: Open Hue in Azure Cloudera Data Warehouse
Now we can perform the select for the registry that we want to change:
select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country
from tpcds.customer c, customer_address ca
where c.c_current_addr_sk = ca.ca_address_sk
and c.c_customer_sk = 11316001;
Figure 6: Address that we want to change.
First, we will insert a new register on the customer_address table. For this, we need to find the last number registered so we won't collide with any current address:
Figure 7: Max ID
Now that we've this, we can insert the new address:
insert into tpcds.customer_address
values (6000001, "AAAAAAAACGICKEAA", "5470", "Great America", "Pkwy", NULL,
"Santa Clara", "Santa Clara County", "CA", "95054", "United States", -7.00,
"Business");
And after this, we can update the customer information with the new id and check it with the same query that we've run first:
Figure 8: Updated Address in Azure
Now that we have the data on Azure, let's access Cloudera Data Warehouse created on AWS in Hue using the same method that we've accessed Hue in Azure with.
In this example, we've already the same schema/tables created in this environment with the data stored in S3 instead ADLS.
Figure 9: Schema of tables in AWS Cloudera Environment
Now in this AWS, we want to create the customer and address table pointing to the tables located in the Azure Virtual Warehouse:
For this step, we need the Azure Virtual Warehouse JDBC address, we can get in Cloudera Data Warehouse UI in the Copy JDBC URL button:
Figure 10: Copy JDBC URL from Azure Cloudera Virtual Warehouse
Now we can execute the following script to create the JDBC tables (please change the "hive.sql.jdbc.url" value with the JDBC address from your Virtual Warehouse. Also change the user/password with your CDP user/password:
use tpcds;
CREATE TABLE `customer_azure`(
`c_customer_sk` int,
`c_customer_id` string,
`c_current_cdemo_sk` int,
`c_current_hdemo_sk` int,
`c_current_addr_sk` int,
`c_first_shipto_date_sk` int,
`c_first_sales_date_sk` int,
`c_salutation` string,
`c_first_name` string,
`c_last_name` string,
`c_preferred_cust_flag` string,
`c_birth_day` int,
`c_birth_month` int,
`c_birth_year` int,
`c_birth_country` string,
`c_login` string,
`c_email_address` string,
`c_last_review_date` string)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver",
"hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>",
"hive.sql.dbcp.username" = "<CDPUSERNAME",
"hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>",
"hive.sql.query" = "SELECT c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date_sk from tpcds.customer",
"hive.sql.dbcp.maxActive" = "1"
);
CREATE TABLE `customer_address_azure`(
`ca_address_sk` int,
`ca_address_id` string,
`ca_street_number` string,
`ca_street_name` string,
`ca_street_type` string,
`ca_suite_number` string,
`ca_city` string,
`ca_county` string,
`ca_state` string,
`ca_zip` string,
`ca_country` string,
`ca_gmt_offset` float,
`ca_location_type` string)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver",
"hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>",
"hive.sql.dbcp.username" = "<CDPUSERNAME",
"hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>",
"hive.sql.query" = "SELECT ca_address_sk,ca_address_id,ca_street_number,ca_street_name,ca_street_type,ca_suite_number,ca_city,ca_county,ca_state,ca_zip,ca_country,ca_gmt_offset,ca_location_type from tpcds.customer_address",
"hive.sql.dbcp.maxActive" = "1"
);
Figure 11: Azure JDBC Tables created in AWS Cloudera Virtual Warehouse
Now we can query these tables in AWS Cloudera Virtual Warehouse:
select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country
from tpcds.customer_azure c, tpcds.customer_address_azure ca
where c.c_current_addr_sk = ca.ca_address_sk
and c.c_customer_sk = 11316001;
Figure 12: Results from Azure Cloudera DW environment in AWS Cloudera DW Environment.
Just to validate, we can run the same query in AWS Cloudera environment with the original tables. We can see that the result is different:
Figure 13: Results from AWS Cloudera VW.
Now we can go to the last step of this article and the easier one. We will use HIVE ACID features to refresh the data from the source tables.
Cloudera ACID provides a powerful option to perform upsert that can be also used by Slow Changng Dimensions: Refer to Update Hive Tables the Easy Way Part 2.
First, we will update the AWS customer table based on the results of the customer_azure table:
use tpcds;
merge into
tpcds.customer
using
tpcds.customer_azure as caz
on
customer.c_customer_sk = caz.c_customer_sk
and caz.c_customer_sk = 11316001
when matched then
update set c_current_addr_sk = caz.c_current_addr_sk;
Figure 14: Updating Customer table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source
Note that we are not inserting a register in case it's not matched, and we're not updating other fields since we only want to demonstrate the address in this example, but this is completely possible. Also, in the WHERE clause, we're defining the customer_sk to match to one register, just for this example.
merge into
tpcds.customer_address
using
tpcds.customer_address_azure as cadaz
on
customer_address.ca_address_sk = cadaz.ca_address_sk
and cadaz.ca_address_sk = 6000001
when not matched then
insert values (cadaz.ca_address_sk,cadaz.ca_address_id,cadaz.ca_street_number,cadaz.ca_street_name,cadaz.ca_street_type,cadaz.ca_suite_number,cadaz.ca_city,cadaz.ca_county,cadaz.ca_state,cadaz.ca_zip,cadaz.ca_country,cadaz.ca_gmt_offset,cadaz.ca_location_type);
Figure 15: Insert new data into Customer Address table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source.
Now that we have updated/inserted new data, we can check the data on AWS with the same query that we've executed in Azure:
select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country
from tpcds.customer c, customer_address ca
where c.c_current_addr_sk = ca.ca_address_sk
and c.c_customer_sk = 11316001;
Figure 16: Fresh data into AWS Cloudera Data Warehouse environment with the same view as Azure Cloudera Data Warehouse environment.
With this, we've demonstrated how to access an Azure Cloudera Data Warehouse environment from an AWS Cloudera Data Warehouse environment and use Hive ACID features to upsert the data.
Going further this can be used as a hybrid multi-cloud strategy where one Cloudera environment can be used for Machine Learning and the other for Data Warehouse (Or DEV/PROD strategy). Also, this data/metadata that we've created can be accessed from other experiences like Data Engineering, Cloudera Machine Learning Data Flows, Data Hubs to have a complete end-to-end scenario.
We can also extend Cloudera Data Engineering with Airflow to schedule the refresh, so this can be periodically done.
Refer to Automating data pipelines using Apache Airflow in Cloudera Data Engineering
With Cloudera Viz in an Impala Cloudera Virtual Warehouse using the same AWS environment that we've used on the steps above, we can create the model:
Figure 17: Data modeling on Cloudera VIZ under an Impala AWS Data Warehouse
Creating the Dashboard is pretty easy since we can use the options that Viz show automatically based on data types:
Figure 18: Dashboard Customer Address Creation.
And we can filter the data to see if the city that we've added is there:
Figure 19: Updated ACID data read via Impala.We've passed through a lot of concepts on this blog post like:
More details on each feature can be searched on this community, stay tuned for more posts!