Member since
02-04-2022
9
Posts
6
Kudos Received
0
Solutions
09-20-2024
11:48 AM
Anshul, I am looking forward to continuing this story at IBM TechXchange 2024 as we talk about Process unstructured data in real-time with GenAI while ensuring compliance with AI regulations
... View more
11-09-2023
03:00 PM
2 Kudos
Overview Purpose of this article is to explain a few ways to access metadata of data objects in Cloudera's Hive & Impala data warehouses. Before we get into the specific details, let's understand a typical method to access metadata and data objects in Cloudera Data Platform. Hive Metastore (HMS) database holds the metadata of all Hive & Impala data objects. When user/application is outside the Cloudera cluster, thrift protocol is used to access it. Hive & Impala data warehouses hold all the data objects. When user/application is outside the Cloudera cluster, Cloudera's JDBC/ODBC drivers are used to access it. As you know Cloudera excels at Hybrid architecture and there is a consistent method of accessing metadata and data across different deployment models. Let's now understand how it works. If you're not already familiar with the CDP deployment models, it's highly recommended to check out the following links: Base, Datahub, CDW Data Service in Private Cloud and CDW Data Service in Public Cloud. A single HMS database is available per environment (environment refers to CDP's On-Premise or Public Cloud installation), to hold information about all the Hive/Impala objects. In other words, regardless of where Hive/Impala objects are created (ex: CDW Data Service), their metadata is going to be available in the single HMS instance that's associated with the environment. To access HMS database outside the cluster in both CDP On-Prem and Public Cloud, thrift protocol is used along with the Kerberos authentication mechanism. We must configure cross-realm Kerberos trust between the user/application and Cloudera environment, so they're able to communicate with each other. Ways to access metadata Below are a few ways to access metadata of data objects in Cloudera's Hive & Impala data warehouses -- Hive Metastore (HMS) HMS database holds the metadata of all Hive & Impala data objects. If you're accessing it outside the Cloudera cluster, use thrift protocol and ensure cross-realm Kerberos trust is configured between the client (user/application) and Cloudera cluster. This is typically done by adding krb5.conf to the client. If you're accessing it within the Cloudera cluster, use the HMS instance that you setup during CDP On-Premise installation or if you're in Public Cloud then use the instructions here. sys database sys database mirrors HMS database and resides in Hive data warehouse. Hive data warehouse is accessed using Cloudera's JDBC/ODBC drivers. Note that even though sys database is only available through Hive data warehouse, it contains the metadata of both Hive & Impala data objects. For reference, see an example query below to retrieve the metadata. -- See a few details about a subset of tables
select b.name, a.tbl_name, a.owner, a.tbl_type, c.location, a.view_original_text, a.view_expanded_text
from sys.tbls a
join sys.dbs b on b.db_id = a.db_id
join sys.sds c on c.sd_id = a.sd_id
where tbl_name like '%tmp_%'; Tip: keep this HMS schema handy when querying the sys database. Apache Atlas API Apache Atlas is a governance tool that lets you see the metadata of all data assets, lineage of data and much more. REST APIs are exposed to retrieve the necessary information remotely. There is also Python library for Apache Atlas available on pypi. Below is a common curl example of accesssing Atlas in Cloudera using REST API: curl -k -u <usr>:<pwd> https://<cloudera_datalake_host>:443/<env>/cdp-proxy-api/atlas/api/atlas/v2/types/typedefs/ Note that full Atlas API URL is available in the Endpoints section of the datalake. See below for reference. Summary Glad you made it this far. Even with the complexity that comes with the Hybrid Architecture, Cloudera has made it simple to access the metadata of data objects in Cloudera's data warehouses. There are multiple options available as discussed above, that customers & partners can choose from based on their requirements.
... View more
Labels:
09-30-2022
04:16 PM
Table of Contents Use Case As a healthcare provider / public health official, I want to respond equitably to the COVID-19 pandemic as quickly as possible, and serve all the communities that are adversely impacted in the state of California. I want to use health equity data reported by California Department of Public Health (CDPH) to identify impacted members and accelerate the launch of outreach programs. Design Collect - Ingest data from https://data.chhs.ca.gov/dataset/covid-19-equity-metrics using NiFi. Enrich - Transform the dataset using Spark and load Hive tables. Report - Gather insights using Hue Editor. Implementation Prerequisites An existing CDP Private Cloud environment with following services - Hadoop Distributed File System (HDFS), Atlas, NiFi, Spark, Hive on Tez, Hue, HBase Download GitHub code here, for easy reference. Add covid-19-equity-metrics-data-dictionary.csv to your storage directory. Add member_profile.csv to your storage directory. Steps to create this data pipeline, are as follows: Please note that this data pipeline's documentation is in accordance with CDP Runtime Version 7.1.7. Step #1 - Setup NiFi Flow Go to NiFi user interface and upload NiFi-CDPH.xml as a template. NiFi-CDPH.xml uses PutHDFS processor to connect to an existing HDFS directory. Please change the properties in this processor to use your own HDFS directory. Execute the flow and ensure InvokeHTTP processors are able to get covid19case_rate_by_social_det.csv and covid19demographicratecumulative.csv. Verify that these files are added to your storage directory. For reference, here's a picture of the flow in NiFi user interface - Step #2 - Setup PySpark Job SSH in to the cluster and make enrich.py program available in any directory. Please change the fs variable in enrich.py program to point to your HDFS directory. Execute the following command - /bin/spark-submit enrich.py and monitor logs to ensure it's finished successfully. It takes approx. 4 minutes to finish. Following Hive tables are created by this job: cdph.data_dictionary cdph.covid_rate_by_soc_det cdph.covid_demo_rate_cumulative member.member_profile member.target_mbrs_by_income member.target_mbrs_by_age_group Step #3 - Identify impacted members in Hue editor Open Hue editor and explore the Hive tables created by PySpark job. -- Raw Data
select * from cdph.data_dictionary a;
select * from cdph.covid_rate_by_soc_det a;
select * from cdph.covid_demo_rate_cumulative a;
select * from member.member_profile a; Let's analyze this data to identify income-groups that are most impacted by COVID-19 select
a.social_det,
a.social_tier,
a.priority_sort,
avg(a.case_rate_per_100k) as avg_rate
from cdph.covid_rate_by_soc_det a
where a.social_det = 'income'
group by a.social_det, a.social_tier, a.priority_sort
order by a.priority_sort; a.social_det a.social_tier a.priority_sort avg_rate income above $120K 0.0 10.7511364396 income $100k - $120k 1.0 14.9906347703 income $80k - $100k 2.0 17.6407007434 income $60k - $80k 3.0 21.9569391068 income $40k - $60k 4.0 25.964262668 income below $40K 5.0 28.9420371609 Let's do one more exercise and identify age-groups that are most impacted by COVID-19 in last 30 days. This query uses metric_value_per_100k_delta_from_30_days_ago column which is the difference between most recent 30-day rate and the previous 30-day rate. select
a.demographic_set,
a.demographic_set_category,
a.metric,
avg(a.metric_value_per_100k_delta_from_30_days_ago) as avg_rate
from cdph.covid_demo_rate_cumulative a
where
demographic_set in ('age_gp4')
and metric in ('cases')
and a.demographic_set_category is not null
group by a.demographic_set, a.demographic_set_category, a.metric
order by avg_rate desc; a.demographic_set a.demographic_set_category a.metric avg_rate age_gp4 18-49 cases 211.566164197452 age_gp4 50-64 cases 159.875170602457 age_gp4 0-17 cases 112.383263616547 age_gp4 65+ cases 96.0969276083687 Based on above results, below $40K is the most impacted income-group in terms of COVID-19 cases, and 18-49 is the most impacted age-group in terms of COVID-19 cases in last 30 days. You can now use this information, to filter members that are in these categories. Execute the following queries to get impacted members: select * from member.target_mbrs_by_income a where social_tier = 'below $40K'; select * from member.target_mbrs_by_age_group a where demographic_set_category = '18-49'; Step #4 - View Hive tables in Atlas Go to Atlas user interface, select any Hive table created in this exercise and see its lineage, schema, audits, etc. Here's a snapshot of covid_rate_by_soc_det table in Atlas.
... View more
09-30-2022
02:58 PM
Table of Contents Use Case As a healthcare provider / public health official, I want to respond equitably to the COVID-19 pandemic as quickly as possible, and serve all the communities that are adversely impacted in the state of California. I want to use health equity data reported by California Department of Public Health (CDPH) to identify impacted members and accelerate the launch of outreach programs. Design Collect - Ingest data from https://data.chhs.ca.gov/dataset/covid-19-equity-metrics using NiFi. Enrich - Transform the dataset using Spark and load Hive tables. Report - Gather insights using Hive tables and Data Visualization. Predict - Connect to Hive tables and build Machine Learning (ML) models of your choice. Implementation Prerequisites An existing CDP Public Cloud environment and knowledge of its basic functions. Download GitHub code here, for easy reference. Add covid-19-equity-metrics-data-dictionary.csv to your storage bucket. Add member_profile.csv to your storage bucket. Steps to create this data pipeline, are as follows: Please note that this data pipeline's documentation is in accordance with CDP Runtime Version 7.2.12. Step #1 - Setup NiFi Flow Create or use a Data Hub Cluster with NiFi. Following Data Hub Cluster type can be used for this exercise - "7.2.12 - Flow Management Light Duty with Apache NiFi, Apache NiFi Registry". Go to NiFi user interface and upload NiFi-CDPH.xml as a template. NiFi-CDPH.xml uses PutS3Object processor to connect to an existing Amazon S3 bucket. Please change the properties in this processor to use your own bucket. If you don't use Amazon S3 storage, please replace PutS3Object processor with a processor of your own choice. Refer NiFi docs for details. For quick reference, here are the frequently used processors to write to a file system - PutAzureDataLakeStorage for Azure Data Lake Storage PutGCSObject for Google Cloud Storage PutHDFS for Hadoop Distributed File System (HDFS) Execute the flow and ensure InvokeHTTP processors are able to get covid19case_rate_by_social_det.csv and covid19demographicratecumulative.csv. Verify that these files are added to your storage bucket. Once you're satisfied with functions of this NiFi flow, download the flow definition. For reference, here's a picture of the flow in NiFi user interface - Step #2 - Setup Cloudera DataFlow (CDF) Now that NiFi flow is ready, it's time to deploy it in your CDP environment. Go to CDF user interface, and ensure CDF service is enabled in your CDP environment. Import flow definition. Select imported flow, click on Deploy and follow the wizard to complete the deployment. Please note that Extra Small NiFi node size is enough for this data ingestion. After deployment is done, you would see the flow in Dashboard. You will be able to manage deployment of your flow in the Dashboard and perform functions like start/terminate flow, view flow, change runtime, view KPIs, view Alerts, etc. In Step #1, you've already executed the NiFi flow to add the source files to your storage bucket. So, you don't need to execute it again from CDF. But even if you do, it's going to just overwrite the files and not hurt anything. Step #3 - Setup Cloudera Data Engineering (CDE) Go to CDE user interface, and ensure CDE service is enabled in your CDP environment & a virtual cluster is available for use. Create a Spark job. In the wizard, upload enrich.py program and leave other options as default. Please change the fs variable in enrich.py program to point to your bucket. Execute the job and monitor logs to ensure it's finished successfully. It takes approx. 4 minutes to finish. Following Hive tables are created by this job: cdph.data_dictionary cdph.covid_rate_by_soc_det cdph.covid_demo_rate_cumulative member.member_profile member.target_mbrs_by_income member.target_mbrs_by_age_group Step #4 - Setup Cloudera Data Warehouse (CDW) Go to CDW user interface. Ensure CDW service is activated in your CDP environment, and a Database Catalog & a Virtual Warehouse compute cluster are available for use. Open Hue editor and explore the Hive tables created by CDE job. -- Raw Data
select * from cdph.data_dictionary a;
select * from cdph.covid_rate_by_soc_det a;
select * from cdph.covid_demo_rate_cumulative a;
select * from member.member_profile a; Step #5 - Setup Cloudera Data Visualization (Data VIZ) Dashboard Go to Data VIZ user interface. Under the DATA tab, create first Dataset - COVID Rate by Social Determinants Dataset Details: Update Dimensions & Measures to look like below: Under the DATA tab, create second Dataset - COVID Demographic Rate Cumulative Dataset Details: Update Dimensions & Measures to look like below: Once Datasets are available, go to VISUALS tab and create a new dashboard. Let's create first visual in the dashboard, to show COVID-19 cases by income-groups. Select Default Hive VW and COVID Rate by Social Determinants from drop down menus, and create a new visual. Set the following parameters - Visual Type - Combo (Combined Bar/Line) Dimension - priority_sort Bar Measure - avg(case_rate_per_100k) Tooltips - max(social_tier) Filters - social_det in ('income') Let's create second visual in the dashboard, to show COVID-19 related deaths by age-groups. Select Default Hive VW and COVID Demographic Rate Cumulative from drop down menus, and create a new visual. Set the following parameters - Visual Type - Lines X Axis - demographic_set_category. Go to Field Properties, and select "Ascending" under "Order and Top K". Y Axis - avg(metric_value_per_100k) Filters - demographic_set in ('age_gp4') metric in ('deaths') county in ('Alameda', 'Contra Costa', 'Los Angeles', 'San Diego', 'San Francisco', 'Ventura'). To see data for all counties in California, USA, remove this filter. For reference, here's the complete dashboard: Step #6 - Identify impacted members in Hue editor As you can see in the visuals, below $40K is the most impacted income group in terms of COVID-19 cases, and 65+ is the most impacted age group in terms of COVID-19 related deaths. You can now use this information, to filter members that are in these categories. Open Hue editor and execute the following queries to get impacted members: select * from member.target_mbrs_by_income a where social_tier = 'below $40K';
select * from member.target_mbrs_by_age_group a where demographic_set_category = '65+'; Step #7 - View Hive tables in Cloudera Data Catalog Go to Data Catalog user interface. Select any Hive table created in this exercise and see its lineage, schema, audits, etc. Step #8 - Setup Cloudera Machine Learning (CML) Go to CML user interface. Under ML Workspaces menu item, provision a new workspace. While provisioning a new workspace, enable Advanced Options and check "Enable Public IP Address for Load Balancer". This could take ~45 minutes to finish. Once workspace is available, create a New Project. Under Initial Setup, Template tab is selected by default, that works for most users. But you also have options to start from scratch (Blank), use existing Applied Machine Leaning Prototypes (AMPs - see AMPs navigation menu item for details), use local files (Local Files) or Git repository (Git). Download covid_outreach.ipynb and upload it in your project. If multiple people are going to work on this project, add them as collaborators with the right role under Collaborators menu item. Once you have the project setup, start a New Session. Select JupyterLab in Editor dropdown and check Enable Spark. In your session, select covid_outreach.ipynb notebook. Please replace YOUR_USERNAME and YOUR_PASSWORD in the notebook with your workload's credentials. If you're not sure how to setup this up, refer Setting the workload password. Execute the notebook and see data in Hive tables. Now, you're ready to play around with the datasets and build your ML models. Run Experiment under Experiments menu item when you have a draft model ready. When you're ready to deploy the model, go to Models menu item and select New Model. Once you're satisfied with the results of your model, create a New Job under Jobs menu item to setup arguments, schedule, notifications & so on. For reference, please see menu items highlighted in Blue box that are referred in prior bullet points.
... View more
09-20-2022
10:25 PM
1 Kudo
Table of Contents Overview In this article, let's build a real-time data visualization to analyze Twitter feeds using Cloudera Data Platform. Design Explanation: NiFi Flow invokes Twitter API v2 (every 15 seconds), and stages all tweets in an AWS S3 bucket. Hive External Tables points to the staging location (i.e. AWS S3 bucket). Data Visualization uses Hive External Tables as its data source, to create visuals. All visuals are refreshed every 20 seconds. Implementation Prerequisites A Cloudera Data Platform (CDP) Public Cloud environment on Amazon Web Services (AWS). If you don't have an existing environment, follow instructions here to set one up - CDP/AWS Quick Start Guide. An app in Twitter's Developer Portal. This is needed to call Twitter API v2. If this is your first time using Twitter API v2, follow these instructions - Step-by-step guide to making your first request to the new Twitter API v2. Download GitHub code here, for easy reference. Step #1 - Cloudera DataFlow (CDF) Go to CDF user interface, and ensure CDF service is enabled in your CDP environment. Import the following flow definition - nifi-twitter-flow.json Select imported flow, click on Deploy, select the Target Environment and begin the deployment process. During the deployment, it's going to ask about the following parameters that this NiFi Flow requires to function: AWS - Access Key ID - visit Understanding and getting your AWS credentials if you're not clear on how to get it. Ensure that AWS IAM user you're using, has "AmazonS3FullAccess" permissions. AWS - Secret Access Key - same instructions as AWS - Access Key ID. AWS S3 Bucket - provide AWS S3 bucket name. Ensure that IAM user has access to this S3 bucket. AWS S3 Bucket Subdirectory - provide subdirectory in AWS S3 bucket where you want to stage your tweets. It's usually best to delete any historical data from this subdirectory, so you are only staging latest tweets. Twitter API v2 Bearer Token - provide your app's bearer token from Twitter's Developer Portal. Twitter Search Term - provide the search term for which you want to do the analysis. For ex: COVID19, DellTechWorld, IntelON, etc. Only one search term is allowed at the moment. Extra Small NiFi node size is enough for this data ingestion. After deployment is done, you would be able to see the flow in Dashboard. Open NiFi Flow to understand how it's working. Notes are available in NiFi Flow to help you understand the use of each processor. All NiFi Flow parameters can be updated while the flow is running, from Deployment Manager. As soon as you Apply Changes, running processors that are affected by the Parameter changes will automatically be restarted. Step #2 - Cloudera Data Warehouse (CDW) Go to CDW user interface. Ensure CDW service is activated in your CDP environment, and a Database Catalog & a Virtual Warehouse compute cluster are available for use. In Hue editor, manually load ISO Language Codes into a table. Default settings in the importer wizard will work fine. If you're not sure how to upload data in Hue, visit Hue Importer -- Select a file, choose a dialect, create a table. In Hue editor, execute twitter-queries.sql. This will create the necessary tables and views, required to support the visuals in the Twitter Dashboard. Please change AWS S3 location to where you've staged the tweets data. After the query execution is successful, you will be able to validate tables using queries below. SELECT * FROM twtr.iso_language_codes a;
SELECT * FROM twtr.tweets b;
SELECT * FROM twtr.twtr_view c;
SELECT * FROM twtr.tweets_by_minute d; Step #3 - Data Visualization Go to CDW user interface, select Data Visualization and add a new Data VIZ. In Data Visualization user interface, create a new connection. You must be logged in as admin to create a new connection. Now that you have a connection to Hive virtual warehouse, let's create two datasets required to support the visuals. Create first dataset: Dataset Title - Twitter View Dataset Source - From Table Select Database - twtr Select Table - twtr_view Create second dataset: Dataset Title - Tweets By Minute Dataset Source - From Table Select Database - twtr Select Table - tweets_by_minute It's now time to Import Visual Artifacts. Take a quick look at Importing a dashboard if you're doing it for the first time. Choose dataviz-twitter-dashboard.json in the import dialog. Once you get the following screen, click ACCEPT AND IMPORT. Twitter Dashboard should be successfully imported at this point. To see it, go to VISUALS from the top menu and select Twitter Dashboard. Congratulations on creating your real-time Twitter Dashboard using Cloudera Data Platform!!! To learn more about its implementation, please register here to watch the recording.
... View more
08-18-2022
08:06 PM
1 Kudo
Table of Contents Overview Over the last few years, Cloudera has evolved to become the Hybrid Data Company. Inarguably, the future of the data ecosystem is a hybrid one that enables common security and governance across on-premise and multi-cloud environments. In this article you will learn how Cloudera enables key hybrid capabilities like application portability, and data replication in order to quickly move workloads and data to the cloud. Let's explore how Cloudera Data Platform (CDP) excels at following hybrid use cases, through data pipeline replication and data pipeline migration exercises to the public cloud. Develop Once and Run Anywhere De-risk Cloud Migration Design The following diagram shows a data pipeline in private & public form factors - Prerequisites A data pipeline in a private cloud environment. Please follow the instructions provided in cdp-pvc-data-pipeline to set one up. Cloudera Data Platform (CDP) on Amazon Web Services (AWS). To get started, visit AWS quick start guide. Implementation Below are the steps to replicate a data pipeline from a private cloud base (PVC) cluster to a public cloud (PC) environment - Replication Manager Prior to creating replication policies, register the private cloud cluster in the public cloud environment. Go to Classic Clusters in CDP PC Management Console and add PVC cluster. In the Cluster Details dialog, select the CDP Private Cloud Base option and fill out all the details. If you have Knox enabled in your PVC cluster, check the "Register KNOX endpoint (Optional)" box and add the KNOX IP address & port. To ensure the PVC cluster works with Replication Manager in the PC environment, please follow the instructions given in Adding CDP Private Cloud Base cluster for use in Replication Manager and Data Catalog. For more details on adding & managing classic clusters, please visit Managing classic clusters. Once the PVC cluster is added, proceed to create policies in Replication Manager. Create HDFS Policy Add all the code to the HDFS directory of your choice. This includes NiFi flow definitions and the PySpark program in this exercise. Now, go to Replication Manager and create an HDFS policy. While creating the policy, you will be asked to provide source cluster, source path, destination cluster, destination path, schedule, and a few other additional settings. If you need help during any step in the process, please visit Using HDFS replication policies and Creating HDFS replication policy. Sample replication policy for reference - Create Hive Policy While creating the policy, you will be asked to provide a source cluster, destination cluster, database & tables to replicate, schedule, and a few other additional settings. In additional settings, ensure "Metadata and Data" is checked under the Replication Option. If you need help during any step in the process, please visit Using Hive replication policies and Creating Hive replication policy Sample replication policy for reference - Create HBase Policy While creating the policy, you will be asked to provide the source cluster, source tables, destination cluster, and a few other initial snapshot settings. If you need help during any step in the process, please visit Using HBase replication policies and Creating HBase replication policy Sample replication policy for reference - NiFi Download NiFi flow definition from PVC cluster on your local machine. Go to Cloudera DataFlow (CDF) in the PC environment and import the flow definition. Update the NiFi flow to use the PutS3Object processor instead of the PutHDFS processor. Update properties in this processor to use desired S3 bucket. With this change, you will start storing incoming files in AWS S3 bucket instead of the PVC HDFS directory. For any help with the PutS3Object processor, please see cdp-data-pipeline nifi flow. Once updates are done, re-import NiFi flow as a newer version and deploy it. Spark Update the PySpark program to use AWS S3 bucket instead of PVC HDFS directory. Go to Cloudera Data Engineering (CDE), and create a Spark job using the PySpark program. Hive Go to Cloudera Data Warehouse (CDW) in the PC environment, and choose Virtual Warehouse of the data lake selected during Hive replication. Now, open Hue editor to access replicated database(s) and table(s). Interested in gathering insights from this data? Check out cdp-data-pipeline data visualization. HBase Go to Cloudera Operational Database (COD) in PC environment, and choose database selected during HBase replication. Now, open Hue editor to access replicated table(s). Data Catalog To ensure the PVC cluster shows up as a data lake in the PC environment's Data Catalog, please follow instructions given in Adding CDP Private Cloud Base cluster for use in Replication Manager and Data Catalog. Once the configuration is done, Data Catalog in the PC environment will let you see data objects available in both the PVC cluster and the PC environment. Conclusion Great job if you have made it this far! Develop Once and Run Anywhere So what did we learn? We learned that with CDP's hybrid capabilities we are able to easily replicate and migrate data pipelines from one environment to another, whether on-premise or in the public cloud with minimal effort! In this exercise, all the code that ran on the private cloud was successfully replicated over to AWS environment with minimal configuration and code changes (due to change in the filesystem - HDFS > S3). A similar set of steps can be followed to migrate data pipelines to Microsoft Azure and Google Cloud Platform (GCP) environments. De-risk Cloud Migration When it comes to migrating to the cloud, it's usually extremely risky to migrate all workloads at the same time. As part of your migration strategy, you need to be able to divide the workloads in a logical manner and migrate them accordingly, and CDP gives you great flexibility to ensure the migration is done according to your strategy. Please don't hesitate to add a comment if you have any questions.
... View more