Member since
11-26-2018
20
Posts
5
Kudos Received
0
Solutions
02-28-2022
03:27 PM
4 Kudos
Introduction
Stock market is always a thing when it comes to data analysis and examples, so how can we easily create an ingestion pipeline that allows quick insights with Cloudera Data Platform?
How about using the new Iceberg table format available in CDP to get new features with everything integrated and secured? Introducing Apache Iceberg in Cloudera Data Platform
In this example, I'll show how to use Cloudera Public Cloud to get data from Rest API and ( with only clicks/parameters input ) start quickly querying your data, dashboard, and advanced analytics.
In this example we will:
Load intraday stock data into the object storage periodically (D-1 data) with Cloudera Dataflow;
Process with Spark the data to make available in Cloudera Data Engineering and schedule with Airflow) This process will check if the table is new and then MERGE INTO (new) stock ICEBERG (new) table; Code is here.
Analyze and query data with Cloudera Data Warehouse/Cloudera Data Visualization;
Perform TIME TRAVEL/SNAPSHOTS (new) in the table;
Also with Iceberg, you can perform Schema Evolution and benefit of open source optimizations and engine interoperability using open formats (like parquet).
Although the stock information is D-1, we will schedule to run each 10 minutes to identify if there is a new stock ticker in the parameters and ingest the data.
Again all without coding and only using saved templates!! Following is the architecture for this example:
Pre-Requisites
To download the stock information, we will use the free API (at the time of this writing) from Alpha Vantage. So, first you will need to register to get your API key that will be used and save it.
Also, you will need the name of the bucket path where you will save the data. Now with all the information that we need (API Key, Bucket) let's start!
Following is the list of parameters that we will need to fill:
Alpha Vantage API Key;
Root Bucket used by CDP;
Workload username;
Workload password;
The components of Cloudera Data Platform that needs to be available are:
Cloudera Data Warehouse with Visualization enabled;
Cloudera Dataflow;
Cloudera Data Engineering (Spark 3) with Airflow enabled;
Create the flow to ingest stock data via API to Object Storage
First download the template file located here and access Cloudera DataFlow where we can upload the template to start loading data into your object storage.
In Cloudera Dataflow UI, click on "Catalog" → "Import Flow Definition"
Put a name for your flow, description and select the file CDF template that you've downloaded:
Click Import;
After deploying, select your flow, and in the menu, click the blue button (Deploy):
Select the CDP Flow Environment that you are using and then continue:
Put your deployment name and click Next
Do not change any NiFi configuration and click Next;
Now it will ask for the parameters that we need to provide, input your
CDP Password: Workload password in CDP
CDP User: Workload user in CDP
S3 Path: Subfolder that will be created under the main <bucket>/user/<youruser>/<s3path> (only the name without s3 etc)
S3 Bucket: The main CDP Bucket used by CDP (only the name without s3 etc)
API Alpha Key: The API key that will be used (demo can only get IBM stock data)
Stock_list: the list of the stocks that you want to load, put in each line the ticker of the stock Note: It is possible to change the stock list after deploying to ingest new ticker data. We will do this to demonstrate the Iceberg time travel feature.
Click Next and aelect the size of the NiFi node and max scaling:
Here you can define any KPI indicators, we will leave it as is and click Next
Review and click Deploy
Done! In minutes we will start receiving stock information into our bucket! If you want you can check in your bucket under the path s3a://<cdpbucket>/user/<yourusername>/<cdppath>/new:
Create Iceberg Tables
For this, you can use the script to create the tables below in an Impala Virtual Warehouse connected to your SDX environment in CDP:
CREATE DATABASE stocks;
CREATE TABLE IF NOT EXISTS stocks.stock_intraday_1min
(interv STRING,output_size STRING,time_zone STRING,open DECIMAL(8,4),high DECIMAL(8,4),low DECIMAL(8,4),close DECIMAL(8,4),volume BIGINT )
PARTITIONED BY (ticker STRING,last_refreshed string,refreshed_at string)
STORED AS iceberg;
Go to Cloudera Data Warehouse UI to access Hue and we will create an Iceberg table to be used by our queries:
Leave Hue open to query data later.
Note: You can change the database name for this and next example I'm using stocks.
Process and Ingest Iceberg using CDE
Now we will use Cloudera Data Engineering to check the files in the object storage, compare if it's new data, and insert them into the Iceberg table. For this, download the jar , and in Cloudera CDE UI, go your Virtual Spark Cluster and click View Jobs
Click in Jobs → Create Jobs:
Name: Put the job name Ex: StockIceberg
File: Upload the jar file stockdatabase_2.11-1.0.jar (Create a resource in the drop-down button)
Main Class com.cloudera.cde.stocks.StockProcessIceberg
Arguments:
<databasename> → (ex: stocks)
<S3 Bucket> → (Same bucket used in Dataflow with the complete path ex: s3a://carrossoni-sa-east/)
<S3 Path> → (Same path used in Dataflow ex: stocks)
<CDP User> → (Same user used in Dataflow ex: carrossoni)
Schedule: Enable and change to run every 10 minutes with the crontab configuration */10 * * * * Click in Schedule
This job will run each 10 minutes to check if there's any new ticker. To run now for first time, click the 3 dots under actions of the job and click "Run Now":
Click in "Job Runs" to check if the job is done. It will take around 3 minutes to spin up the resources in Kubernetes and execute the pipeline to ingest into the final table the new data.
Also, you can check the cluster:
This application is very simple, it will:
Check new files in the new directory;
Create a temp table in Spark/cache this table and identify duplicated rows (in case that NiFi loaded the same data again);
MERGE INTO the final table, INSERT new data or UPDATE if exists;
Archive files in the bucket;
After execution, the processed files will be in your bucket but under the "processed"+date directory:
Now let's query data!
Query Iceberg Tables in Hue and Cloudera Data Visualization
Now we should have the data ingested, let's go back to Hue and select the data in the table stocks.stock_intraday_1min:
In Cloudera Data Visualization, I can also select this table to create a new Dataset called "Stocks" and create visualizations:
For example, stocks by volume:
Also, you can use Cloudera CDP tools to ingest data from other sources and create your own stock analyzer platform!
Iceberg Advanced Features
Apache Iceberg delivers the ability to:
Time Travel
Schema Evolution
Partition Evolution
And a lot of other things that you can benefit from it. Also, it's engine-independent and will use the optimization that each engine already has implemented natively.
Our example will load the intraday stock daily since the free API does not give real-time data, but we can change the Cloudera Dataflow Parameter to add one more ticker and we've scheduled to run hourly the CDE process. After this we will be able to see the new ticker information in the dashboard and also perform time travel using Iceberg !
Go to Dataflow, and in the Dashboard, click in your deployed flow and then Manage Deployment:
Now click Parameters:
Scroll down and change the stock_list to add the new ticker. I'm adding NVDA ticker but you can choose another one, after this click in Apply Changes:
The flow will be redeployed and it will also execute each minute, so you can check later if the new ticker is processed/loaded into the iceberg table by our CDE process that is scheduled to run periodically.
After some minutes (around 10 which is our schedule) you can check the table snapshots with the following query:
DESCRIBE HISTORY stocks.stock_intraday_1min;
In my case, the Spark process executed sometimes, and I can see the snapshot for each execution:
I'll query the tickers that I had before the last snapshot with the query below. Change the snapshotid to the value that you got with the first query:
SELECT count(*), ticker
FROM stocks.stock_intraday_1min
FOR SYSTEM_VERSION AS OF <snapshotid>
GROUP BY ticker;
Now let's query the table without the snapshot id:
We can see NVDA is reflected in the last snapshot!!!
Summary
We've created a full ingestion, processing pipeline in a few clicks. That's the power of Cloudera Data Platform, it's an end-to-end use case that can be easily deployed following only parameters.
You can extend your work with Cloudera Machine Learning; there's an example in this blog where some changes will be needed to point to the table that we've created.
Lastly, we've seen a little of the power of Apache Iceberg, already integrated into Cloudera Data Platform in Public Cloud.
... View more
02-01-2022
01:52 PM
Hi @fedesardo As per link that I've provided Python is supported via Livy Livy (supports Spark, Spark SQL, PySpark, PySpark3, and SparkR) And Note: PySpark and associated libraries require Python version 2.7 or later, or Python version 3.4 or later, installed on all nodes. So you can use Python via this, indeed just Python interpreter isn't supported, but you can use it in the platform in this way. Hope this helps.
... View more
01-31-2022
08:02 PM
Hello, You can take a look in: https://docs.cloudera.com/runtime/7.2.2/using-zeppelin/topics/configuring_and_using_zeppelin_interpreters.html Basically, needs to configure in the nodes the python version that you want and then the right configs (like path) in Zeppelin.
... View more
10-02-2021
09:53 AM
2 Kudos
Introduction
In this article, I'll show how to stream data into CDP Public Cloud using Cloudera Dataflow/Streaming Datahub and query the data using Cloudera Data Warehouse.
Pre-Requisites
For this exercise you'll need Cloudera Data Platform with:
Cloudera Data Warehouse;
Datahub Flow Management;
Datahub Streams Messaging;
This exercise and flow are based on the sensor data/pipeline in edge2ai-workshop, but it is in a modified version that I'll share in another repository.
1. Create the Streaming Table in Cloudera Data Warehouse
For this exercise we need two virtual warehouses:
A Hive virtual warehouse: Used only to perform the compaction process ( Streaming data ingest ). For this example, I've named it latam-hive;
A Unified Analytics Virtual Warehouse: Used to query data and visualization. Unified Analytics is a very exciting new feature for CDW customers at no extra fee!. You can learn more here. For this example, I've named it latam-vw;
Figure 1: Virtual Warehouses used on this exercise.
Now I can access the Hue interface in latam-hive VW to create the table used to store the sensor data that we'll be streaming:
create database streaming;
CREATE TABLE `streaming`.`sensors`(
`sensor_ts` timestamp,
`sensor_id` double,
`sensor_0` double,
`sensor_1` double,
`sensor_2` double,
`sensor_3` double,
`sensor_4` double,
`sensor_5` double,
`sensor_6` double,
`sensor_7` double,
`sensor_8` double,
`sensor_9` double,
`sensor_10` double,
`sensor_11` double)
CLUSTERED BY (sensor_ts) INTO 32 BUCKETS;
2. Getting the SDX configuration and Copy to NiFi nodes
Before configuring the flow in NiFi, we'll need to upload some configuration files in the NiFi nodes.
2.1.1. In CDP Console, go to Environment -> <YourEnvironment> and then click on " Data Lake ";
2.1.2. In the data lake name in the right menu, click on "View Client Configuration URLs"
2.1.3. Download the "Hive Metastore" configuration; this will be a zip file containing the files, unzip the file;
2.1.4. Copy (via scp for example) the files core-site.xml, hdfs-site.xml, and hive-site.xml to /tmp folder of each NiFi node in your Datahub environment. Since I've only one node for this example, I will just need to do this once (ex: scp hdfs-site.xml hive-site.xml core-site.xml <cdpworkloadusername>@<publicnifinodeip>:/tmp) and make all files readable in each node (ex: chmod a=r hdfs-site.xml);
3. Configuring the NiFi Streaming Flow
Now we'll use the NiFi Streaming Flow to simulate the sensor data and send via streaming data to the Hive Metastore located in the SDX platform in CDP.
First access NiFi in the Data Flow Datahub:
Figure 2: Open NiFi in Data Hub Cluster
Now we'll upload the NiFi template located on Github. This template is based on the Edg2AI workshop, but there's a change to create the data randomly directly in the flow, not using MiNiFi. You can get the flow template here.
In the NiFi canvas on the top menu
Select "Process Group" and drag and drop to the empty canvas. A new menu will appear. Select browse
to upload the template that you've just downloaded, and then click in ADD.
Figure 3: Process Group Streaming Created
After this, you can double-click in the Streaming Process Group and see that there are more two Process Groups:
1. IoT Data Generator: Used to simulate sensor data, random errors, and put in a Kafka topic.
2. Kafka to Hive: Used to consume the Kafka topic in the first Process Group and send the data via streaming to the table that we've created.
3.1 - Configuring IoT Data Generator Group
Double click in the "IoT Data Generator" group and we'll need to update some configuration to make it work:
3.1.1. In the Operate menu inside the Process Group
click in the engine to configure the "Controller Services":
First, click on the lightning button in "JsonRecordSetWriter" and "JsonTreeReader" controllers and enable both controllers;
There'll be two controllers called "Default NiFi SSL Context Service", but one is on an "Invalid" state. Click in the right on the "Arrow" icon and then click on "Remove" button to remove this invalid service ;
At the end you should have this:
3.1.2 Now, close this screen, and in the IoT Data Generator group, double-click on the " PublishKafkaRecord_2 " Processor and update the following configuration in the Properties tab:
"Kafka Brokers": Change the value to the DNS of your Kafka DNS/port where the data will be sent. Example: messaging-broker:9093, if you're using a Streams Messaging Data Hub, this can be easily located in Streams Messaging Manager;
"Kerberos Principal": The principal of your user, you can obtain it via SSH in a NiFi node using your CDP User/password and perform a kinit/klist. More information is available here.
"Username": Your workload username;
"Password": Your workload password;
"SSL Context Service" : Select "Default NiFi SSL Context Service" in the drop-down menu;
Apply the changes and this will close the configuration;
At the end, your flow may look like this:
Figure 4: IoT Generator flow
Now go back to the initial group "Streaming" using the bottom left menu
and now, we can configure the next Processor Group to consume the messages and send via streaming to our table.
3.2. Configuring Kafka to Hive Group
Double-click in the "Kafka to Hive" group and we'll need to update some configuration to make it work: 3.2.1. In the Operate menu inside the Process Group
click in the engine to configure the 'Controller Services':
Click in the lightning button in "JsonRecordSetWriter" and "JsonTreeReader" controllers and Enable both controllers;
In the end, you should have this:
3.2.2. Now close this screen and still in the IoT Data Generator group, double-click on the " ConsumeKafka_2_0 " Processor and update the following configuration in the Properties tab:
"Kafka Brokers": Change the value to the DNS of your Kafka DNS/port where the data will be sent. Example: messaging-broker:9093, if you're using a Streams Messaging Data Hub this can be easily located in Streams Messaging Manager;
"Kerberos Principal": The principal of your user, you can obtain it via ssh in a nifi node using your CDP User/password and perform a kinit/klist. More information in https://github.com/asdaraujo/cdp-examples#using-kerberos-authentication ;
"Username": Your workload username;
"Password": Your workload password;
"SSL Context Service" : Select "Default NiFi SSL Context Service" in the drop down menu;
Apply the changes and this will close the configuration;
3.2.3. The last processor to configure is the "PutHive3Streaming " processor, double-click in this processor and configure:
"Hive Metastore URI": Change the value to the DNS of your Data Lake Master Node DNS/port, ex: thrift://master-node:9083, this can be located in CDP UI;
"Hive Configuration Resources": Check if the paths are valid since it can change, for this, you can ssh in a NiFi node and check the configuration;
"Database Name": streaming (or the name of the database that you've chosen to create);
"Table": sensors (or the name of the table that you've defined);
"Kerberos Principal" : Your workload username;
"Kerberos Password" : Your workload password;
Apply the changes and this will close the configuration;
In the end, your flow may look like this:
Figure 5: Kafka to Hive Flow
Leave the group again and start both Processor groups; you can right-click in each one and click on the Start button.
Figure 6: Flow in Action!
4. Query the Streaming Data in Cloudera Data Warehouse
Now we can simply see the streaming data directly in the Unified Analytics Virtual Warehouse and/or connect Cloudera Data Visualization or a dashboard via JDBC/ODBC to visualize the data:
Figure 7: Query the streaming sensor data in Cloudera Data Warehouse
And we can monitor in real-time that the data is increasing:
Figure 8: First Count
Figure 9: Second Count
Lastly, we can connect Cloudera Data Visualization directly in the table that is being ingested and see how can we quickly drive value on this data:
Summary
In this blog post, we've seen how to achieve/create:
A flow to create random sensor data, send the message to a topic, consume this topic, and stream to a table;
Query this data using Cloudera Data Warehouse;
More details on each concept that we've seen on this post can be found in:
Streaming: Stream data into HIVE like a Boss using NiFi HiveStreaming - Olympics 1896-2008
Compaction: Data compaction
... View more
07-29-2021
12:10 PM
Hi @duhizjame As you may have see network issues is the most common problem since it depends of other variables. Now after reading your process I understand that you're stuck on the download/distribute phase, normally this happense because of insufficient disk space since it needs to download all parcels and then use it to install, since the parcels are already in /opt/cloudera/parcel-repo this means that the process is ok. Does the logs in /var/log/cloudera-scm-server show something? Regards, Luiz
... View more
06-28-2021
10:16 AM
Hi @Cisco94, Thanks! No you don't need it since the process indeed use the trial repository. After the 60 day trial it won't be possible to access Cloudera Manager/Manage the cluster but the services/data will continue there. Since it's intended for learning purposes you can quickly spin up a new trial VM again.
... View more
06-10-2021
02:09 PM
Accessing AWS Cloudera Data Warehouse to query data on Azure Cloudera Data Warehouse
Introduction
Cloudera Data Platform enables in a single console to work with different public cloud providers. With this, you can have a true hybrid environment with only one admin console.
Cloudera Data Warehouse is a public cloud service that allows fast analytics in your preferred cloud provider.
In this article, I'll show how easy it is to connect between two Virtual Warehouses located in different cloud providers using Cloudera Data Warehouse.
Scenario
We're using two different cloud providers for Cloudera Data Warehouse: one in AWS with TPC-DS data and another in Azure with the same TPC-DS data. We'll use Hive ACID to update the customer table on Azure and merge it with the customer table in AWS.
1.Pre-Requisites
1.1 - Cloudera CDP Control Plane Access and Register two environments
For this exercise, you will need access to the Cloudera Data Platform. More information can be accessed here.
Also, since we will use two environments (AWS and Azure), we need to register the environments on the CDP control plane.
For AWS, refer to Introduction to AWS environments
For Azure, refer to Introduction to Azure environments
Figure 1: Environments Used.
Now we can set up the virtual warehouses for each environment that will work with the data.
2.1 - Create two Virtual Warehouses
After the environment automatic setup, we can activate in Cloudera Data Warehouse:
Figure 2: Environments activated on Cloudera Data Warehouse experience.
And create our virtual warehouses
Figure 3: Data warehouses created in different environments
Now we can go to the next part i.e. to start our analysis between those two environments. We'll use the JDBC Storage Handler to communicate between one environment to another.
2. Prepare database/table
For each Virtual Warehouse, we've uploaded on the bucket the TPC-DS data and created the tables.
2.1 - Change address for customer on Azure using ACID features
For the table customer, we want to change their address with a new register in Azure environment and reflect it on AWS environment:
First access Hue on Azure Cloudera Data Warehouse on Cloudera Data Warehouse UI and execute the next steps:
Figure 5: Open Hue in Azure Cloudera Data Warehouse
Now we can perform the select for the registry that we want to change:
select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country
from tpcds.customer c, customer_address ca
where c.c_current_addr_sk = ca.ca_address_sk
and c.c_customer_sk = 11316001;
Figure 6: Address that we want to change.
First, we will insert a new register on the customer_address table. For this, we need to find the last number registered so we won't collide with any current address:
Figure 7: Max ID
Now that we've this, we can insert the new address:
insert into tpcds.customer_address
values (6000001, "AAAAAAAACGICKEAA", "5470", "Great America", "Pkwy", NULL,
"Santa Clara", "Santa Clara County", "CA", "95054", "United States", -7.00,
"Business");
And after this, we can update the customer information with the new id and check it with the same query that we've run first:
Figure 8: Updated Address in Azure
2.2 - Create the External JDBC Table to connect from AWS Cloudera VW to Azure Cloudera VW
Now that we have the data on Azure, let's access Cloudera Data Warehouse created on AWS in Hue using the same method that we've accessed Hue in Azure with.
In this example, we've already the same schema/tables created in this environment with the data stored in S3 instead ADLS.
Figure 9: Schema of tables in AWS Cloudera Environment
Now in this AWS, we want to create the customer and address table pointing to the tables located in the Azure Virtual Warehouse:
Creating Customer Azure External Table in AWS Cloudera VW:
For this step, we need the Azure Virtual Warehouse JDBC address, we can get in Cloudera Data Warehouse UI in the Copy JDBC URL button:
Figure 10: Copy JDBC URL from Azure Cloudera Virtual Warehouse
Now we can execute the following script to create the JDBC tables (please change the "hive.sql.jdbc.url" value with the JDBC address from your Virtual Warehouse. Also change the user/password with your CDP user/password:
use tpcds;
CREATE TABLE `customer_azure`(
`c_customer_sk` int,
`c_customer_id` string,
`c_current_cdemo_sk` int,
`c_current_hdemo_sk` int,
`c_current_addr_sk` int,
`c_first_shipto_date_sk` int,
`c_first_sales_date_sk` int,
`c_salutation` string,
`c_first_name` string,
`c_last_name` string,
`c_preferred_cust_flag` string,
`c_birth_day` int,
`c_birth_month` int,
`c_birth_year` int,
`c_birth_country` string,
`c_login` string,
`c_email_address` string,
`c_last_review_date` string)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver",
"hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>",
"hive.sql.dbcp.username" = "<CDPUSERNAME",
"hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>",
"hive.sql.query" = "SELECT c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date_sk from tpcds.customer",
"hive.sql.dbcp.maxActive" = "1"
);
CREATE TABLE `customer_address_azure`(
`ca_address_sk` int,
`ca_address_id` string,
`ca_street_number` string,
`ca_street_name` string,
`ca_street_type` string,
`ca_suite_number` string,
`ca_city` string,
`ca_county` string,
`ca_state` string,
`ca_zip` string,
`ca_country` string,
`ca_gmt_offset` float,
`ca_location_type` string)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver",
"hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>",
"hive.sql.dbcp.username" = "<CDPUSERNAME",
"hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>",
"hive.sql.query" = "SELECT ca_address_sk,ca_address_id,ca_street_number,ca_street_name,ca_street_type,ca_suite_number,ca_city,ca_county,ca_state,ca_zip,ca_country,ca_gmt_offset,ca_location_type from tpcds.customer_address",
"hive.sql.dbcp.maxActive" = "1"
);
Figure 11: Azure JDBC Tables created in AWS Cloudera Virtual Warehouse
Now we can query these tables in AWS Cloudera Virtual Warehouse:
select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country
from tpcds.customer_azure c, tpcds.customer_address_azure ca
where c.c_current_addr_sk = ca.ca_address_sk
and c.c_customer_sk = 11316001;
Figure 12: Results from Azure Cloudera DW environment in AWS Cloudera DW Environment.
Just to validate, we can run the same query in AWS Cloudera environment with the original tables. We can see that the result is different:
Figure 13: Results from AWS Cloudera VW.
3.Update data on Cloudera AWS Data Warehouse using Cloudera Azure Data Warehouse tables
Now we can go to the last step of this article and the easier one. We will use HIVE ACID features to refresh the data from the source tables.
3.1 Use ACID with MERGE syntax to upsert the Customer and Customer Address tables:
Cloudera ACID provides a powerful option to perform upsert that can be also used by Slow Changng Dimensions: Refer to Update Hive Tables the Easy Way Part 2.
First, we will update the AWS customer table based on the results of the customer_azure table:
use tpcds;
merge into
tpcds.customer
using
tpcds.customer_azure as caz
on
customer.c_customer_sk = caz.c_customer_sk
and caz.c_customer_sk = 11316001
when matched then
update set c_current_addr_sk = caz.c_current_addr_sk;
Figure 14: Updating Customer table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source
Note that we are not inserting a register in case it's not matched, and we're not updating other fields since we only want to demonstrate the address in this example, but this is completely possible. Also, in the WHERE clause, we're defining the customer_sk to match to one register, just for this example.
merge into
tpcds.customer_address
using
tpcds.customer_address_azure as cadaz
on
customer_address.ca_address_sk = cadaz.ca_address_sk
and cadaz.ca_address_sk = 6000001
when not matched then
insert values (cadaz.ca_address_sk,cadaz.ca_address_id,cadaz.ca_street_number,cadaz.ca_street_name,cadaz.ca_street_type,cadaz.ca_suite_number,cadaz.ca_city,cadaz.ca_county,cadaz.ca_state,cadaz.ca_zip,cadaz.ca_country,cadaz.ca_gmt_offset,cadaz.ca_location_type);
Figure 15: Insert new data into Customer Address table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source.
Now that we have updated/inserted new data, we can check the data on AWS with the same query that we've executed in Azure:
select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country
from tpcds.customer c, customer_address ca
where c.c_current_addr_sk = ca.ca_address_sk
and c.c_customer_sk = 11316001;
Figure 16: Fresh data into AWS Cloudera Data Warehouse environment with the same view as Azure Cloudera Data Warehouse environment.
4. Conclusion and Going Further
With this, we've demonstrated how to access an Azure Cloudera Data Warehouse environment from an AWS Cloudera Data Warehouse environment and use Hive ACID features to upsert the data.
Going further this can be used as a hybrid multi-cloud strategy where one Cloudera environment can be used for Machine Learning and the other for Data Warehouse (Or DEV/PROD strategy). Also, this data/metadata that we've created can be accessed from other experiences like Data Engineering, Cloudera Machine Learning Data Flows, Data Hubs to have a complete end-to-end scenario.
We can also extend Cloudera Data Engineering with Airflow to schedule the refresh, so this can be periodically done.
Refer to Automating data pipelines using Apache Airflow in Cloudera Data Engineering
5. Bonus: Using Impala and Cloudera Viz to present the ACID Data
With Cloudera Viz in an Impala Cloudera Virtual Warehouse using the same AWS environment that we've used on the steps above, we can create the model:
Figure 17: Data modeling on Cloudera VIZ under an Impala AWS Data Warehouse
Creating the Dashboard is pretty easy since we can use the options that Viz show automatically based on data types:
Figure 18: Dashboard Customer Address Creation.
And we can filter the data to see if the city that we've added is there:
Figure 19 : Updated ACID data read via Impala.
6. Summary
We've passed through a lot of concepts on this blog post like:
How to operate in a Hybrid Cloud Warehouse scenario using Cloudera Data Platform and Cloudera Data Warehouse
Hive ACID features
Cloudera Viz
Impala ACID read
More details on each feature can be searched on this community, stay tuned for more posts!
... View more
05-06-2021
08:42 AM
Hi @MiamiDataEng , To access via ssh you just need to enter via vagrant comand: $ vagrant ssh There's no need to put username and password. If CM was installed correctly you'll be able to access in your broswer via URL http://localhost:7180 User/password initially is admin/admin Thanks, Luiz
... View more
03-04-2021
08:37 AM
Hi @Ant5566 Can you take a look at /opt/cloudera/cm-agent/service/hive/hive.sh specially the TEZ_JARS classpath? Let me know if the path exists or it's correct, if not maybe it's needed to change to the correct location and them restart the services. Thanks, Luiz
... View more
10-02-2020
09:09 AM
Hi @vijaypabothu , this looks a network error. can you check with your network team if the ports are open?
... View more
09-26-2020
10:51 AM
This seems related to https://community.cloudera.com/t5/Support-Questions/zookeeper-error-Unexpected-exception-causing-shutdown-while/td-p/30914 I would also try to look which application is causing this. Regards, Luiz
... View more
09-24-2020
07:26 AM
Hi @vijaypabothu , This error means that the node is on a invalid state but not the root cause, more debug is needed. Does the ZK logs shows something? Also I would look at https://community.cloudera.com/t5/Support-Questions/hiveserver2-alway-shut-down/td-p/155439 Regards, Luiz
... View more
09-23-2020
11:38 AM
Hi @vijaypabothu , First try to Check if the Hive Server is running and if it's on the address 10.83.35.142 with the 10000 port. This message indicates that Hive Server is down or HUE Can't reach the server. Luiz
... View more
09-21-2020
01:50 PM
Introduction
Cloudera Data Warehouse architecture leverage compute/storage separation, this is different from the standard Hadoop architecture.
Figure 1: Cloudera Modern Data Warehouse Architecture
The objective of this post is to show how to import the wide functions and code material that we have on the open-source community inside Cloudera Data Warehouse using the object storage architecture. For this, we'll use ESRI Spatial Framework as an example.
Prerequisites
We'll use github to download the ESRI project, Java and maven to build the necessary JAR files.
Step 1: Download the files from ESRI Github repository
Download the necessary files from ESRI Spatial Framework Github repository, this can be done using the following command:
$ git clone https://github.com/Esri/spatial-framework-for-hadoop.git
Figure 2: Cloning ESRI project
This will create a dir called "spatial-framework-for-hadoop", enter in this directory to build the project, and generate the JAR files that will be used for the functions.
Step 2: Build the project using Maven
To build the project using Apache Maven,
Install it from the Maven website and perform the installation according to your OS.
Within the ESRI github project directory, you can perform the build using the following: $ mvn package
After a successful run you should see something like this:
Figure 3: Building ESRI project
Step 3: Copy the JAR files to the Cloudera Data Warehouse Object Storage
After creating the JAR files containing the functions that will be used, copy them to the object storage that is being used. In this example, we're using AWS S3.
You can use the same bucket that is being used by Cloudera Data Warehouse for External Data or add in another bucket. For more information, see Adding access to external S3 buckets for Cloudera Data Warehouse clusters on AWS.
The build will create the JAR file that will be necessary to upload to the object storage: spatial-sdk-hive-2.1.1-SNAPSHOT.jar -> Located in <path/to/githubproject>/spatial-framework-for-hadoop/hive/target
In my example, I've created a jars folder in my bucket and uploaded using the AWS S3 Console upload tool.
Upload JAR in the object storage bucket:
Figure 4: Upload JAR File into the object storage.
File uploaded:
Figure 5: JAR uploaded in the object storage.
Step 4: Create the Functions
Now that the JAR file is in the object storage, you need just to create the functions inside Cloudera Data Warehouse pointing to the JAR that is uploaded.
In the Virtual Warehouse DAS or HUE you can use the following syntax to create the functions (this example creates the ST_Geometry function): CREATE FUNCTION ST_Geometry AS 'com.esri.hadoop.hive.ST_Geometry' USING JAR 's3a://<BucketName>/warehouse/tablespace/external/jars/spatial-sdk-hive-2.1.1-SNAPSHOT.jar';
For more CREATE FUNCTION statements for ESRI you can visit my Github link .
Step 5: Test the Functions
Now the functions are ready to be used. Run the following to test if it's working submitting: SELECT ST_AsText(ST_Point(1, 2));
Figure 6: Functions working
Summary
In this article we saw how easy it is to import/create the vast functions ecosystem in the open-source community inside Cloudera Data Warehouse, we used specifically the ESRI Spatial functions.
For more information on how to use ESRI functions in Cloudera Data Platform you can check Geo-spatial Queries with Hive using ESRI Geometry and Spatial Framework for Hadoop or Esri/gis-tools-for-hadoop.
... View more
08-08-2020
09:14 AM
Hi @AkhilTech , The primary problem is that the host isn't in health state, this is evaluated before provisioning the cluster. This can happen if there isn't enough resources to provision CDP. If there's enough resources you can: Check if the service cloudera-scm-agent is configured properly: - Check if the file /etc/cloudera-scm-agent/config.ini has the lines server_host= and listening_ip= with the same results of the commands "hostname" and "host cloudera". - Check if the cloudera-scm-agent process is running "sudo service cloudera-scm-agent status". - Restart the cloudera-scm-agent whit "sudo service cloudera-scm-agent restart". After this you can check the logs for error messages inside the VM: - /var/log/cloudera-scm-server/cloudera-scm-server.log - /var/log/cloudera-scm-agent/cloudera-scm-agent.log Specially in cloudera-scm-agent log look for the last lines that should contain the error messages on why your host isn't healthy. You can also use Cloudera Manager to identify what's happening and restart the agent services, if you click in the top left Cloudera logo you should see the initial page and you can go to "Hosts" to see what's happening. Click on the top left Cloudera Manager Icon Click in Hosts --> All Hosts Click on host cloudera, if everything is health should appear something like this: If not first try to restart the Cloudera Management Service (Actions -> Restart), and wait to see the results. If there's errors you can follow the messages to see the logs and what's may be causing the errors. If it goes up you can run the provisioning python again. Let me know if this helps, Thanks, Luiz
... View more
08-07-2020
06:17 AM
Hi Akhil, It seems that the download wasn't completed and the host isn't in a health state can you login in http://localhost:7180 via admin/admin? If yes you can check in hosts whats the status of the host? Thanks, Luiz
... View more
08-06-2020
11:58 AM
If you've the audits logs configured you can see in Cloudera Navigator, more information here. Looks like something was removed from the directory, I recommend check if this is the case in the logs or check what happened with the file that's giving the error: /app/abc/footable/tablename/account/orgid=abcd/batch_id=NWMISSPAYWRADJ/aa4fbef1c0bb3fd5-85012b8600000018_1953707135_data.0.parq Check if this file exists in HDFS, if it was removed externally that's the error cause and you can solve restarting Impala Service when possible.
... View more
08-06-2020
10:02 AM
What version you've? Can you see if something was deleted in the audit logs? Also try to restart impala services to see if this is resolved.
... View more
08-03-2020
05:41 PM
8 Kudos
Introduction Cloudera Data Platform Base doesn't have one Quickstart/Sandbox VM like the ones for CDH/HDP releases that helped a lot of people (including me), to learn more about the open-source components and also see the improvements from the community in CDP Runtime. The objective of this tutorial is to enable and create a VM from scratch via some automation (Shell Script and Cloudera Template) that can help whoever wants to use and/or learn Cloudera CDP in a Sandbox/Quickstart like environment in your machine. Pre-Requisites This exercise is performed on a Mac OS but you can install Vagrant/Virtualbox on Windows/Linux machines (https://www.vagrantup.com/docs/installation). The versions below were tested at the moment of writing this blog and may change in the future. The machine needs to have at least: 80 GB of free disk space; 12 GB free RAM; 8 free VCPU; Good internet connection; Install Virtualbox and Vagrant These are the software that we'll use to run our virtualized environment and to download and install Virtualbox and Vagrant execute the following commands in your host machine (For MAC OS): For Mac $ brew cask install virtualbox $ brew cask install vagrant $ brew cask install vagrant-manager The manager is optional and can be used to manage your Virtual Machines on the menu bar. For Windows Download Virtualbox here and Vagrant here and install the files. Also, take a look at this instruction regarding hypervisor. For Linux Follow Virtualbox and Vagrant instructions to install in your Linux Version. Step 1: Vagrant Centos 7 Virtual Machine Setup with CDP Download the Centos VM and the files necessary for set up in an empty folder. In this example, I'll download within the "~/cdpvm/" folder. Also, in your host machine execute the following commands: $ cd ~
$ mkdir cdpvm
$ cd cdpvm
$ wget https://cloud.centos.org/centos/7/vagrant/x86_64/images/CentOS-7-x86_64-Vagrant-2004_01.VirtualBox.box
$ wget https://raw.githubusercontent.com/carrossoni/CDPDCTrial/master/scripts/VMSetup.sh Go to the folder that you've downloaded your VM file (cd ~/cdpvm) and initialize the Virtual Machine using the following command: $ vagrant box add CentOS-7-x86_64-Vagrant-2004_01.VirtualBox.box --name centos7
$ vagrant plugin install vagrant-disksize
$ vagrant init centos7 After this step, you should have a file called "Vagrantfile" in the same directory, open the file with an editor (vim for example) and below the line config.vm.box = "centos7" add the following: config.vm.network "public_network"
config.vm.network :forwarded_port, guest: 7180, host: 7180
config.vm.network :forwarded_port, guest: 8889, host: 8889
config.vm.network :forwarded_port, guest: 9870, host: 9870
config.vm.network :forwarded_port, guest: 6080, host: 6080
config.vm.network :forwarded_port, guest: 21050, host: 21050
config.vm.hostname = "localhost"
config.disksize.size = "80GB"
config.vm.provision "shell", path: "VMSetup.sh"
config.vm.provider "virtualbox" do |vb|
# Display the VirtualBox GUI when booting the machine
vb.gui = true
# Customize the amount of memory on the VM:
vb.memory = "12024"
vb.cpus = "8"
end Save the file and now we can init and bring up the VM: $ vagrant up Now it'll ask to bridge to your public network (only for the first time) normally it's the one that you're connected on the internet, in my case is en0: After this, the VM will be provisioned and automated CDP process will start, this will take up to one hour depending on your connection since also it'll configure the VM and also install all the components for Cloudera Manager and the Services in an automated process located in https://github.com/carrossoni/CDPDCTrial/ The template and the cluster created at the end will contain the following services: HUE
HDFS
Hive Metastore
Impala
Ranger
Zookeeper After the install you can add more services like Nifi, Kafka etc. depending on the number of resources that you've reserved for the VM. After the execution you should see the exit below (this will take up about 30 min to one hour depending on your connection since it'll download all the packages and parcels necessary for provisioning CDP Runtime): After this the VM will reboot to do a fresh start, wait around 5 minutes for the services spin up and go to the next step. Troubleshooting: If the install process failed, likely it's a problem during the VM configuration if CM was installed you can try going to https://localhost:7180 directrly and finish the install process manually via Cloudera Manager UI To ssh there's two options, the easy one is to simple go to directory that the Vagrantfile is located (that you have used to perform the setup of the VM) and type: $ vagrant ssh The other option is to configure your VM in the Virtualbox UI to attach a USB and copy the clouderakey.pem file that was created during the automation process. Then you are able to ssh the machine via "ssh -i clouderakey.pem vagrant@cloudera" After ssh using both scenarios you can sudo the box and start looking the machine, try to see if the hostname and ip in /etc/hosts is configured properly (most common issue since depends of your machine network). If after the template import you have an error message, cloudera manager can show what's happening, work in the error and then resume the import cluster template process in the running commands tab. If you are in this step now normally is a matter to view logs and/or see if there isn't resources available, at the end you can restart the cluster to see if it's something that was stuck. This is normal since we are working in a constrained environment. Step 2: Cloudera Data Platform Access After the automated process our CDP Runtime is ready (actually we've provisioned in only one step)! In your machine browser you can connect to the CM with the following URL: http://localhost:7180 Password will be admin/admin after the first login you can choose the 60-day trial option and click in "Continue": The Welcome page appears, click in the Cloudera Logo on the top left since we've already added a new cluster with the automated process: At this point all the services are initiated, some errors may happen since we are working on a resource constraint environment, usually follow the logs that it'll be easy to see in Cloudera Manager what's happening, also you can suppress warning messages if it's not something critical. We've our environment ready to work and learn more about CDP! HUE and Data Access You can log in in Hue from the URL http://localhost:8889/hue and for the first time we will use the user admin/admin, this will be the admin user for HUE: For example, I'll upload data from the California COVID-19 Testing that I've downloaded to my machine. In HUE go on the left panel and choose "Importer" → Type = File, choose /user/admin directory and then click in "Upload a file", choose your file (statewide_testing.csv) and then "Open". Now click in the file that you've uploaded and this will go to the next step: Click in Next and HUE will infer the table name, field types etc, you can change or leave as is and click in "Submit": At the end you should see the success of the job, close the job status window, and click in the Query button: Now that we've hour data we can query and use Impala SQL in the data that we've uploaded! (Optional) Ranger Security Masking with Impala Example To start using/querying the environment with the system user/password that we've created (cloudera/cloudera) first we need to enter in Ranger we need to allow access to this user, click in the Ranger service and then in Ranger Admin WebUI: Now we have the initial Ranger screen. Login with the user/password admin/cloudera123: In the HADOOP SQL session click in the Hadoop SQL link. We will create a new policy to allow access to the new table but seeing the tested column in masked format with null results. For that click in the Masking tab and then Add New Policy with the following values: Click in the Add button and now go back to the Access tab and Add New Policy Button with the following parameters: Click in Add button and now our user should be ready to select only the data on this table with the masked values. First we'll configure the user in HUE, in the left panel click in the initial button and then in "Manage Users": Click in "Add User" and then in username put cloudera with the password cloudera, you can skip step 2 and 3 clicking directly in Add user. Logout from HUE and login with our new create user, go to the query editor and select the data again: You should see the masked policy in action! Summary In this blog we've learned: How to Setup a Vagrant Centos 7 machine with Virtualbox and CDP Packages Configure CDP-DC for the first run Configure data access Setup simple security policies with the masking feature You can play with the services, install other parcels like Kafka/Nifi/Kudu to create a streaming ingestion pipeline, and query in real-time with Spark/Impala. Of course for that, you'll need more resources and this can be changed in the beginning during the VM Configuration.
... View more
05-17-2020
02:58 PM
2 Kudos
Introduction
How do we quickly gain insight and start working with data in a secure, governed, and scalable environment in the cloud?
This article explains how to achieve this using the Cloudera Data Warehouse platform connected with Apache Superset.
Cloudera Data Warehouse in CDP (Cloudera Data Platform) is an enterprise solution for modern analytics. It's an auto-scaling, highly concurrent, and cost-effective hybrid, a multi-cloud analytics solution that ingests data anywhere, at massive scale, from structured, unstructured, and edge sources.
Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application.
Pre-Requisites
This exercise is performed on a Mac OS . The versions below were tested at the moment of writing this article and may change in the future:
Python
Python 3.7.5
pip 20.0.2
After Python/pip installation, install the following packages/versions in Python (we recommend using venv before this step):
impyla==0.16.1
thrift==0.13.0
thrift_sasl==0.2.1
Apache Superset Configuration
Apache Superset can be installed on your machine or executed in a Docker environment. In this example, we will use the steps provided in Python Virtualenv and the version is:
apache-superset==0.999.0.dev0
After setting up the environment, you can access Superset UI with in the following address:
http://127.0.0.1:8088/
Figure 1: Welcome to Apache Superset
The default username/password is admin/admin.
Cloudera Data Warehouse
If you don't have an Impala Virtual Warehouse (used in this example), you need to create one that will connect to the Database Catalog. This is a very simple step and can be done in minutes. Once you have created a virtual warehouse, if your Database Catalog already has the Tables, Security, and Metadata Definitions to be used, you or the user/application (in our case Apache Superset) can start using the platform. More information can be obtained in this link .
Figure 2: Cloudera Data Warehouse
Here, we will be using the "default-impala" Virtual Warehouse. Since the environment is not running and nobody is using it, it is not consuming any resources. After the Virtual Warehouse creation, you will need to collect the URL to connect to your environment like the following example:
Figure 3: Getting Access URL in Cloudera Data Warehouse
Once you save the access URL, you can configure the Dashboard in Apache Superset.
Configure Cloudera Data Warehouse as Source Database
After the prerequisites, we'll configure the connection in Apache Superset. To start creating the dashboard in Cloudera Data Warehouse, perform the following
Click Source > Database in the top left menu:
Figure 4: Configuring Source Database
On the top right corner click in the "Add new record " button:
Figure 5: Add new database button
Now, we need to put the configuration in the following screen:
Figure 6: Configuring Database
jdbc:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=luizcarrossoni;PWD=PASSWORD To: impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default?auth_mechanism=PLAIN&http_path=cliservice&use_http_transport=True&use_ssl=TrueExpose in SQL Lab: Checked Allow Multi Schema Metadata Fetch: Checked Extra: Here, we'll pass our Cloudera Data Platform access credentials, there are other ways to do this that are more secure in Apache Superset: { "metadata_params": {}, "engine_params": { "connect_args": { "user" : "<cdpuser>", "password" : "<password>" } } }
Database Name: Choose a name for example "ClouderaDataPlatform"
SQLAlchemy URI : We'll use the Access URL that we got in Cloudera Console, we need to customize the URI in order to use impyla and the URL supported by SQLAlchemy:
From:
After providing the config information, c lick the Test button in the SQLAlchemy URI Field, to see if everything is working properly. If the Virtual Warehouse is in Stopped state, it'll first start the Warehouse and then you'll see that the test was successful:
Figure 7: Starting Virtual Warehouse
Figure 8: Connection Successful
Now you can save the connection and start creating your dashboards.
Query Data through SQL Lab
You can query the data in the Virtual Warehouse using SQL Lab in Superset:
Figure 9: Query Data in SQL Lab
Note: Since the table is querying the data that supposedly has PII information (ccnumber), the data comes as hashes. This is because we have the following policy in place for the user:
Figure 10: Masking Policy
Create your Dashboard
To create the Dashboard using Apache Superset in Cloudera Data Platform, do the following:
Add the table as a source in the following menu:
Figure 11: Add Table Source
Add the ww_customers_data table to start creating the dashboard:
Figure 12: Create Source Table
Create Charts using the source table that is created and use the charts in a Dashboard :
... View more