About carrossoni

OlivierT · ‎10-12-2022

Nice article, But not working any more. Failing to find mariadb repo as mariadb 10 was archived.

steven-matison · ‎08-30-2022

Excellent work @carrossoni . I really see value in the time travel feature!!

carrossoni · ‎02-01-2022

Hi @fedesardo As per link that I've provided Python is supported via Livy Livy (supports Spark, Spark SQL, PySpark, PySpark3, and SparkR) And Note: PySpark and associated libraries require Python version 2.7 or later, or Python version 3.4 or later, installed on all nodes. So you can use Python via this, indeed just Python interpreter isn't supported, but you can use it in the platform in this way. Hope this helps.

carrossoni · ‎10-02-2021

Introduction In this article, I'll show how to stream data into CDP Public Cloud using Cloudera Dataflow/Streaming Datahub and query the data using Cloudera Data Warehouse. Pre-Requisites For this exercise you'll need Cloudera Data Platform with: Cloudera Data Warehouse; Datahub Flow Management; Datahub Streams Messaging; This exercise and flow are based on the sensor data/pipeline in edge2ai-workshop, but it is in a modified version that I'll share in another repository. 1. Create the Streaming Table in Cloudera Data Warehouse For this exercise we need two virtual warehouses: A Hive virtual warehouse: Used only to perform the compaction process (Streaming data ingest). For this example, I've named it latam-hive; A Unified Analytics Virtual Warehouse: Used to query data and visualization. Unified Analytics is a very exciting new feature for CDW customers at no extra fee!. You can learn more here. For this example, I've named it latam-vw; Figure 1: Virtual Warehouses used on this exercise. Now I can access the Hue interface in latam-hive VW to create the table used to store the sensor data that we'll be streaming: create database streaming; CREATE TABLE `streaming`.`sensors`( `sensor_ts` timestamp, `sensor_id` double, `sensor_0` double, `sensor_1` double, `sensor_2` double, `sensor_3` double, `sensor_4` double, `sensor_5` double, `sensor_6` double, `sensor_7` double, `sensor_8` double, `sensor_9` double, `sensor_10` double, `sensor_11` double) CLUSTERED BY (sensor_ts) INTO 32 BUCKETS; 2. Getting the SDX configuration and Copy to NiFi nodes Before configuring the flow in NiFi, we'll need to upload some configuration files in the NiFi nodes. 2.1.1. In CDP Console, go to Environment -> <YourEnvironment> and then click on "Data Lake"; 2.1.2. In the data lake name in the right menu, click on "View Client Configuration URLs" 2.1.3. Download the "Hive Metastore" configuration; this will be a zip file containing the files, unzip the file; 2.1.4. Copy (via scp for example) the files core-site.xml, hdfs-site.xml, and hive-site.xml to /tmp folder of each NiFi node in your Datahub environment. Since I've only one node for this example, I will just need to do this once (ex: scp hdfs-site.xml hive-site.xml core-site.xml <cdpworkloadusername>@<publicnifinodeip>:/tmp) and make all files readable in each node (ex: chmod a=r hdfs-site.xml); 3. Configuring the NiFi Streaming Flow Now we'll use the NiFi Streaming Flow to simulate the sensor data and send via streaming data to the Hive Metastore located in the SDX platform in CDP. First access NiFi in the Data Flow Datahub: Figure 2: Open NiFi in Data Hub Cluster Now we'll upload the NiFi template located on Github. This template is based on the Edg2AI workshop, but there's a change to create the data randomly directly in the flow, not using MiNiFi. You can get the flow template here. In the NiFi canvas on the top menu Select "Process Group" and drag and drop to the empty canvas. A new menu will appear. Select browse to upload the template that you've just downloaded, and then click in ADD. Figure 3: Process Group Streaming Created After this, you can double-click in the Streaming Process Group and see that there are more two Process Groups: 1. IoT Data Generator: Used to simulate sensor data, random errors, and put in a Kafka topic. 2. Kafka to Hive: Used to consume the Kafka topic in the first Process Group and send the data via streaming to the table that we've created. 3.1 - Configuring IoT Data Generator Group Double click in the "IoT Data Generator" group and we'll need to update some configuration to make it work: 3.1.1. In the Operate menu inside the Process Group click in the engine to configure the "Controller Services": First, click on the lightning button in "JsonRecordSetWriter" and "JsonTreeReader" controllers and enable both controllers; There'll be two controllers called "Default NiFi SSL Context Service", but one is on an "Invalid" state. Click in the right on the "Arrow" icon and then click on "Remove" button to remove this invalid service; At the end you should have this: 3.1.2 Now, close this screen, and in the IoT Data Generator group, double-click on the "PublishKafkaRecord_2" Processor and update the following configuration in the Properties tab: "Kafka Brokers": Change the value to the DNS of your Kafka DNS/port where the data will be sent. Example: messaging-broker:9093, if you're using a Streams Messaging Data Hub, this can be easily located in Streams Messaging Manager; "Kerberos Principal": The principal of your user, you can obtain it via SSH in a NiFi node using your CDP User/password and perform a kinit/klist. More information is available here. "Username": Your workload username; "Password": Your workload password; "SSL Context Service": Select "Default NiFi SSL Context Service" in the drop-down menu; Apply the changes and this will close the configuration; At the end, your flow may look like this: Figure 4: IoT Generator flow Now go back to the initial group "Streaming" using the bottom left menu and now, we can configure the next Processor Group to consume the messages and send via streaming to our table. 3.2. Configuring Kafka to Hive Group Double-click in the "Kafka to Hive" group and we'll need to update some configuration to make it work: 3.2.1. In the Operate menu inside the Process Group click in the engine to configure the 'Controller Services': Click in the lightning button in "JsonRecordSetWriter" and "JsonTreeReader" controllers and Enable both controllers; In the end, you should have this: 3.2.2. Now close this screen and still in the IoT Data Generator group, double-click on the "ConsumeKafka_2_0" Processor and update the following configuration in the Properties tab: "Kafka Brokers": Change the value to the DNS of your Kafka DNS/port where the data will be sent. Example: messaging-broker:9093, if you're using a Streams Messaging Data Hub this can be easily located in Streams Messaging Manager; "Kerberos Principal": The principal of your user, you can obtain it via ssh in a nifi node using your CDP User/password and perform a kinit/klist. More information in https://github.com/asdaraujo/cdp-examples#using-kerberos-authentication; "Username": Your workload username; "Password": Your workload password; "SSL Context Service": Select "Default NiFi SSL Context Service" in the drop down menu; Apply the changes and this will close the configuration; 3.2.3. The last processor to configure is the "PutHive3Streaming" processor, double-click in this processor and configure: "Hive Metastore URI": Change the value to the DNS of your Data Lake Master Node DNS/port, ex: thrift://master-node:9083, this can be located in CDP UI; "Hive Configuration Resources": Check if the paths are valid since it can change, for this, you can ssh in a NiFi node and check the configuration; "Database Name": streaming (or the name of the database that you've chosen to create); "Table": sensors (or the name of the table that you've defined); "Kerberos Principal": Your workload username; "Kerberos Password": Your workload password; Apply the changes and this will close the configuration; In the end, your flow may look like this: Figure 5: Kafka to Hive Flow Leave the group again and start both Processor groups; you can right-click in each one and click on the Start button. Figure 6: Flow in Action! 4. Query the Streaming Data in Cloudera Data Warehouse Now we can simply see the streaming data directly in the Unified Analytics Virtual Warehouse and/or connect Cloudera Data Visualization or a dashboard via JDBC/ODBC to visualize the data: Figure 7: Query the streaming sensor data in Cloudera Data Warehouse And we can monitor in real-time that the data is increasing: Figure 8: First Count Figure 9: Second Count Lastly, we can connect Cloudera Data Visualization directly in the table that is being ingested and see how can we quickly drive value on this data: Summary In this blog post, we've seen how to achieve/create: A flow to create random sensor data, send the message to a topic, consume this topic, and stream to a table; Query this data using Cloudera Data Warehouse; More details on each concept that we've seen on this post can be found in: Streaming: Stream data into HIVE like a Boss using NiFi HiveStreaming - Olympics 1896-2008 Compaction: Data compaction

carrossoni · ‎06-10-2021

Accessing AWS Cloudera Data Warehouse to query data on Azure Cloudera Data Warehouse Introduction Cloudera Data Platform enables in a single console to work with different public cloud providers. With this, you can have a true hybrid environment with only one admin console. Cloudera Data Warehouse is a public cloud service that allows fast analytics in your preferred cloud provider. In this article, I'll show how easy it is to connect between two Virtual Warehouses located in different cloud providers using Cloudera Data Warehouse. Scenario We're using two different cloud providers for Cloudera Data Warehouse: one in AWS with TPC-DS data and another in Azure with the same TPC-DS data. We'll use Hive ACID to update the customer table on Azure and merge it with the customer table in AWS. 1.Pre-Requisites 1.1 - Cloudera CDP Control Plane Access and Register two environments For this exercise, you will need access to the Cloudera Data Platform. More information can be accessed here. Also, since we will use two environments (AWS and Azure), we need to register the environments on the CDP control plane. For AWS, refer to Introduction to AWS environments For Azure, refer to Introduction to Azure environments Figure 1: Environments Used. Now we can set up the virtual warehouses for each environment that will work with the data. 2.1 - Create two Virtual Warehouses After the environment automatic setup, we can activate in Cloudera Data Warehouse: Figure 2: Environments activated on Cloudera Data Warehouse experience. And create our virtual warehouses Figure 3: Data warehouses created in different environments Now we can go to the next part i.e. to start our analysis between those two environments. We'll use the JDBC Storage Handler to communicate between one environment to another. 2. Prepare database/table For each Virtual Warehouse, we've uploaded on the bucket the TPC-DS data and created the tables. 2.1 - Change address for customer on Azure using ACID features For the table customer, we want to change their address with a new register in Azure environment and reflect it on AWS environment: First access Hue on Azure Cloudera Data Warehouse on Cloudera Data Warehouse UI and execute the next steps: Figure 5: Open Hue in Azure Cloudera Data Warehouse Now we can perform the select for the registry that we want to change: select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country from tpcds.customer c, customer_address ca where c.c_current_addr_sk = ca.ca_address_sk and c.c_customer_sk = 11316001; Figure 6: Address that we want to change. First, we will insert a new register on the customer_address table. For this, we need to find the last number registered so we won't collide with any current address: Figure 7: Max ID Now that we've this, we can insert the new address: insert into tpcds.customer_address values (6000001, "AAAAAAAACGICKEAA", "5470", "Great America", "Pkwy", NULL, "Santa Clara", "Santa Clara County", "CA", "95054", "United States", -7.00, "Business"); And after this, we can update the customer information with the new id and check it with the same query that we've run first: Figure 8: Updated Address in Azure 2.2 - Create the External JDBC Table to connect from AWS Cloudera VW to Azure Cloudera VW Now that we have the data on Azure, let's access Cloudera Data Warehouse created on AWS in Hue using the same method that we've accessed Hue in Azure with. In this example, we've already the same schema/tables created in this environment with the data stored in S3 instead ADLS. Figure 9: Schema of tables in AWS Cloudera Environment Now in this AWS, we want to create the customer and address table pointing to the tables located in the Azure Virtual Warehouse: Creating Customer Azure External Table in AWS Cloudera VW: For this step, we need the Azure Virtual Warehouse JDBC address, we can get in Cloudera Data Warehouse UI in the Copy JDBC URL button: Figure 10: Copy JDBC URL from Azure Cloudera Virtual Warehouse Now we can execute the following script to create the JDBC tables (please change the "hive.sql.jdbc.url" value with the JDBC address from your Virtual Warehouse. Also change the user/password with your CDP user/password: use tpcds; CREATE TABLE `customer_azure`( `c_customer_sk` int, `c_customer_id` string, `c_current_cdemo_sk` int, `c_current_hdemo_sk` int, `c_current_addr_sk` int, `c_first_shipto_date_sk` int, `c_first_sales_date_sk` int, `c_salutation` string, `c_first_name` string, `c_last_name` string, `c_preferred_cust_flag` string, `c_birth_day` int, `c_birth_month` int, `c_birth_year` int, `c_birth_country` string, `c_login` string, `c_email_address` string, `c_last_review_date` string) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver", "hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>", "hive.sql.dbcp.username" = "<CDPUSERNAME", "hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>", "hive.sql.query" = "SELECT c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date_sk from tpcds.customer", "hive.sql.dbcp.maxActive" = "1" ); CREATE TABLE `customer_address_azure`( `ca_address_sk` int, `ca_address_id` string, `ca_street_number` string, `ca_street_name` string, `ca_street_type` string, `ca_suite_number` string, `ca_city` string, `ca_county` string, `ca_state` string, `ca_zip` string, `ca_country` string, `ca_gmt_offset` float, `ca_location_type` string) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "org.apache.hive.jdbc.HiveDriver", "hive.sql.jdbc.url" = "<CHANGEMEwithJDBC>", "hive.sql.dbcp.username" = "<CDPUSERNAME", "hive.sql.dbcp.password" = "<CDPWORKLOADPASSWORD>", "hive.sql.query" = "SELECT ca_address_sk,ca_address_id,ca_street_number,ca_street_name,ca_street_type,ca_suite_number,ca_city,ca_county,ca_state,ca_zip,ca_country,ca_gmt_offset,ca_location_type from tpcds.customer_address", "hive.sql.dbcp.maxActive" = "1" ); Figure 11: Azure JDBC Tables created in AWS Cloudera Virtual Warehouse Now we can query these tables in AWS Cloudera Virtual Warehouse: select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country from tpcds.customer_azure c, tpcds.customer_address_azure ca where c.c_current_addr_sk = ca.ca_address_sk and c.c_customer_sk = 11316001; Figure 12: Results from Azure Cloudera DW environment in AWS Cloudera DW Environment. Just to validate, we can run the same query in AWS Cloudera environment with the original tables. We can see that the result is different: Figure 13: Results from AWS Cloudera VW. 3.Update data on Cloudera AWS Data Warehouse using Cloudera Azure Data Warehouse tables Now we can go to the last step of this article and the easier one. We will use HIVE ACID features to refresh the data from the source tables. 3.1 Use ACID with MERGE syntax to upsert the Customer and Customer Address tables: Cloudera ACID provides a powerful option to perform upsert that can be also used by Slow Changng Dimensions: Refer to Update Hive Tables the Easy Way Part 2. First, we will update the AWS customer table based on the results of the customer_azure table: use tpcds; merge into tpcds.customer using tpcds.customer_azure as caz on customer.c_customer_sk = caz.c_customer_sk and caz.c_customer_sk = 11316001 when matched then update set c_current_addr_sk = caz.c_current_addr_sk; Figure 14: Updating Customer table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source Note that we are not inserting a register in case it's not matched, and we're not updating other fields since we only want to demonstrate the address in this example, but this is completely possible. Also, in the WHERE clause, we're defining the customer_sk to match to one register, just for this example. merge into tpcds.customer_address using tpcds.customer_address_azure as cadaz on customer_address.ca_address_sk = cadaz.ca_address_sk and cadaz.ca_address_sk = 6000001 when not matched then insert values (cadaz.ca_address_sk,cadaz.ca_address_id,cadaz.ca_street_number,cadaz.ca_street_name,cadaz.ca_street_type,cadaz.ca_suite_number,cadaz.ca_city,cadaz.ca_county,cadaz.ca_state,cadaz.ca_zip,cadaz.ca_country,cadaz.ca_gmt_offset,cadaz.ca_location_type); Figure 15: Insert new data into Customer Address table in Cloudera AWS Data Warehouse using Azure Cloudera Data Warehouse as source. Now that we have updated/inserted new data, we can check the data on AWS with the same query that we've executed in Azure: select c.c_current_addr_sk, ca.ca_street_name, ca.ca_country from tpcds.customer c, customer_address ca where c.c_current_addr_sk = ca.ca_address_sk and c.c_customer_sk = 11316001; Figure 16: Fresh data into AWS Cloudera Data Warehouse environment with the same view as Azure Cloudera Data Warehouse environment. 4. Conclusion and Going Further With this, we've demonstrated how to access an Azure Cloudera Data Warehouse environment from an AWS Cloudera Data Warehouse environment and use Hive ACID features to upsert the data. Going further this can be used as a hybrid multi-cloud strategy where one Cloudera environment can be used for Machine Learning and the other for Data Warehouse (Or DEV/PROD strategy). Also, this data/metadata that we've created can be accessed from other experiences like Data Engineering, Cloudera Machine Learning Data Flows, Data Hubs to have a complete end-to-end scenario. We can also extend Cloudera Data Engineering with Airflow to schedule the refresh, so this can be periodically done. Refer to Automating data pipelines using Apache Airflow in Cloudera Data Engineering 5. Bonus: Using Impala and Cloudera Viz to present the ACID Data With Cloudera Viz in an Impala Cloudera Virtual Warehouse using the same AWS environment that we've used on the steps above, we can create the model: Figure 17: Data modeling on Cloudera VIZ under an Impala AWS Data Warehouse Creating the Dashboard is pretty easy since we can use the options that Viz show automatically based on data types: Figure 18: Dashboard Customer Address Creation. And we can filter the data to see if the city that we've added is there: Figure 19: Updated ACID data read via Impala. 6. Summary We've passed through a lot of concepts on this blog post like: How to operate in a Hybrid Cloud Warehouse scenario using Cloudera Data Platform and Cloudera Data Warehouse Hive ACID features Cloudera Viz Impala ACID read More details on each feature can be searched on this community, stay tuned for more posts!

carrossoni · ‎03-04-2021

Hi @Ant5566 Can you take a look at /opt/cloudera/cm-agent/service/hive/hive.sh specially the TEZ_JARS classpath? Let me know if the path exists or it's correct, if not maybe it's needed to change to the correct location and them restart the services. Thanks, Luiz

tusharkathpal · ‎10-06-2020

Hello @syedshakir Are you inserting data into the table externally of Impala (that is via hive, sqoop, spark etc). If yes, the Impala may not be aware of the newly added files and running invalidate metadata/refresh on the table may fix your issue.

carrossoni · ‎09-21-2020

Introduction Cloudera Data Warehouse architecture leverage compute/storage separation, this is different from the standard Hadoop architecture. Figure 1: Cloudera Modern Data Warehouse Architecture The objective of this post is to show how to import the wide functions and code material that we have on the open-source community inside Cloudera Data Warehouse using the object storage architecture. For this, we'll use ESRI Spatial Framework as an example. Prerequisites We'll use github to download the ESRI project, Java and maven to build the necessary JAR files. Step 1: Download the files from ESRI Github repository Download the necessary files from ESRI Spatial Framework Github repository, this can be done using the following command: $ git clone https://github.com/Esri/spatial-framework-for-hadoop.git Figure 2: Cloning ESRI project This will create a dir called "spatial-framework-for-hadoop", enter in this directory to build the project, and generate the JAR files that will be used for the functions. Step 2: Build the project using Maven To build the project using Apache Maven, Install it from the Maven website and perform the installation according to your OS. Within the ESRI github project directory, you can perform the build using the following: $ mvn package After a successful run you should see something like this: Figure 3: Building ESRI project Step 3: Copy the JAR files to the Cloudera Data Warehouse Object Storage After creating the JAR files containing the functions that will be used, copy them to the object storage that is being used. In this example, we're using AWS S3. You can use the same bucket that is being used by Cloudera Data Warehouse for External Data or add in another bucket. For more information, see Adding access to external S3 buckets for Cloudera Data Warehouse clusters on AWS. The build will create the JAR file that will be necessary to upload to the object storage: spatial-sdk-hive-2.1.1-SNAPSHOT.jar -> Located in <path/to/githubproject>/spatial-framework-for-hadoop/hive/target In my example, I've created a jars folder in my bucket and uploaded using the AWS S3 Console upload tool. Upload JAR in the object storage bucket: Figure 4: Upload JAR File into the object storage. File uploaded: Figure 5: JAR uploaded in the object storage. Step 4: Create the Functions Now that the JAR file is in the object storage, you need just to create the functions inside Cloudera Data Warehouse pointing to the JAR that is uploaded. In the Virtual Warehouse DAS or HUE you can use the following syntax to create the functions (this example creates the ST_Geometry function): CREATE FUNCTION ST_Geometry AS 'com.esri.hadoop.hive.ST_Geometry' USING JAR 's3a://<BucketName>/warehouse/tablespace/external/jars/spatial-sdk-hive-2.1.1-SNAPSHOT.jar'; For more CREATE FUNCTION statements for ESRI you can visit my Github link. Step 5: Test the Functions Now the functions are ready to be used. Run the following to test if it's working submitting: SELECT ST_AsText(ST_Point(1, 2)); Figure 6: Functions working Summary In this article we saw how easy it is to import/create the vast functions ecosystem in the open-source community inside Cloudera Data Warehouse, we used specifically the ESRI Spatial functions. For more information on how to use ESRI functions in Cloudera Data Platform you can check Geo-spatial Queries with Hive using ESRI Geometry and Spatial Framework for Hadoop or Esri/gis-tools-for-hadoop.

carrossoni · ‎05-17-2020

Introduction How do we quickly gain insight and start working with data in a secure, governed, and scalable environment in the cloud? This article explains how to achieve this using the Cloudera Data Warehouse platform connected with Apache Superset. Cloudera Data Warehouse in CDP (Cloudera Data Platform) is an enterprise solution for modern analytics. It's an auto-scaling, highly concurrent, and cost-effective hybrid, a multi-cloud analytics solution that ingests data anywhere, at massive scale, from structured, unstructured, and edge sources. Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application. Pre-Requisites This exercise is performed on a Mac OS. The versions below were tested at the moment of writing this article and may change in the future: Python Python 3.7.5 pip 20.0.2 After Python/pip installation, install the following packages/versions in Python (we recommend using venv before this step): impyla==0.16.1 thrift==0.13.0 thrift_sasl==0.2.1 Apache Superset Configuration Apache Superset can be installed on your machine or executed in a Docker environment. In this example, we will use the steps provided in Python Virtualenv and the version is: apache-superset==0.999.0.dev0 After setting up the environment, you can access Superset UI with in the following address: http://127.0.0.1:8088/ Figure 1: Welcome to Apache Superset The default username/password is admin/admin. Cloudera Data Warehouse If you don't have an Impala Virtual Warehouse (used in this example), you need to create one that will connect to the Database Catalog. This is a very simple step and can be done in minutes. Once you have created a virtual warehouse, if your Database Catalog already has the Tables, Security, and Metadata Definitions to be used, you or the user/application (in our case Apache Superset) can start using the platform. More information can be obtained in this link. Figure 2: Cloudera Data Warehouse Here, we will be using the "default-impala" Virtual Warehouse. Since the environment is not running and nobody is using it, it is not consuming any resources. After the Virtual Warehouse creation, you will need to collect the URL to connect to your environment like the following example: Figure 3: Getting Access URL in Cloudera Data Warehouse Once you save the access URL, you can configure the Dashboard in Apache Superset. Configure Cloudera Data Warehouse as Source Database After the prerequisites, we'll configure the connection in Apache Superset. To start creating the dashboard in Cloudera Data Warehouse, perform the following Click Source > Database in the top left menu: Figure 4: Configuring Source Database On the top right corner click in the "Add new record" button: Figure 5: Add new database button Now, we need to put the configuration in the following screen: Figure 6: Configuring Database jdbc:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default;AuthMech=3;transportMode=http;httpPath=cliservice;ssl=1;UID=luizcarrossoni;PWD=PASSWORDTo:impala://example-default-impala.env-pkXXXX.dwx.example.site:443/default?auth_mechanism=PLAIN&http_path=cliservice&use_http_transport=True&use_ssl=TrueExpose in SQL Lab: CheckedAllow Multi Schema Metadata Fetch: CheckedExtra: Here, we'll pass our Cloudera Data Platform access credentials, there are other ways to do this that are more secure in Apache Superset:{ "metadata_params": {}, "engine_params": { "connect_args": { "user" : "<cdpuser>", "password" : "<password>" } } } Database Name: Choose a name for example "ClouderaDataPlatform" SQLAlchemy URI: We'll use the Access URL that we got in Cloudera Console, we need to customize the URI in order to use impyla and the URL supported by SQLAlchemy: From: After providing the config information, click the Test button in the SQLAlchemy URI Field, to see if everything is working properly. If the Virtual Warehouse is in Stopped state, it'll first start the Warehouse and then you'll see that the test was successful: Figure 7: Starting Virtual Warehouse Figure 8: Connection Successful Now you can save the connection and start creating your dashboards. Query Data through SQL Lab You can query the data in the Virtual Warehouse using SQL Lab in Superset: Figure 9: Query Data in SQL Lab Note: Since the table is querying the data that supposedly has PII information (ccnumber), the data comes as hashes. This is because we have the following policy in place for the user: Figure 10: Masking Policy Create your Dashboard To create the Dashboard using Apache Superset in Cloudera Data Platform, do the following: Add the table as a source in the following menu: Figure 11: Add Table Source Add the ww_customers_data table to start creating the dashboard: Figure 12: Create Source Table Create Charts using the source table that is created and use the charts in a Dashboard:

Online	Offline
Last Visited	‎06-03-2022 12:41 PM

Member Since	‎11-26-2018 06:01 AM
Last Visited	‎06-03-2022 12:41 PM
Posts	20
Kudos received	8

Cloudera Community

Re: How to create a Centos7 CDP-DC Base VM for san...

Re: Using Iceberg Table Format in CDP Public Cloud...

Re: Install python interpreter in CDP Public Cloud

Streaming/Query data to CDP Public Cloud Using Clo...

Accessing Cloudera Data Warehouse Created on Diffe...

Re: Please advise-Getting this error in Hue - Cou...

Re: Failed to open HDFS file

Import ESRI Spatial Framework functions in Clouder...

How to connect Apache Superset (incubating) in Clo...