Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
01-22-2022
12:59 PM
I got unknown CA ssl error for line 3. How did you resolve it?
... View more
08-19-2021
06:24 AM
Hello, I am running this from the company network and I believe we have some sort of certificate for using cloudera-impala. When I copy the URL from the impala_prod it gives me at the end also a uid(which is my ID) and a password which is a standard password(not given by me at any point in time). So when I run this script this is the error I recieve: java.sql.SQLException: java.sql.SQLException: [Cloudera][ImpalaJDBCDriver](500170) Error occurred while setting up ALTUS Dynamic Discovery: Unable to load credentials from provider files. Do you have any ideas how can I fix this?
... View more
02-09-2021
04:57 AM
Hi @joyabrata I think you are looking in the Data Lake tab which is a different one, you can go to the Summary tab, then scroll down to FreeIPA session then click Actions and get Get FreeIPA Certificate from the drop-down menu. Hope this will help you.
... View more
01-13-2021
07:09 AM
@joyabrata you have a few options: - you can study the documentation expectations for outbound dettails here https://docs.cloudera.com/management-console/cloud/proxy/topics/mc-whitelist-urls-environment.html - you can use automation scripts like: https://github.com/paulvid/cdp-one-click/ to set up everything end to end.
... View more
12-23-2020
07:04 AM
3 Kudos
In a previous article, I realized that I saved my flow as a flow file instead of a template, which may make it hard to import for some. So in this article, I will explain how to import a flow to NiFi registry in Datahub (knowing that NiFi registry is secure by default in CDP Datahub). Step 1: Download and configure stores Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host] Next, copy and authorize the key and trust stores: $ sudo su
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp
$ chmod a+rw /tmp/cm-auto-host_keystore.jks
$ chmod a+rw /tmp/cm-auto-global_truststore.jks Step 2: Create a registry.properties file Go to /tmp (for instance) and create the following file - registry.properties file: baseUrl=https://fod-nifi-cluster-gateway0.fod-cdp.a465-9q4k.cloudera.site:18433
keystore=/tmp/cm-auto-host_keystore.jks
keystoreType=JKS
keystorePasswd=[YOUR_KEYSTORE_PWD]
keyPasswd=[YOUR_KEYSTORE_PWD]
truststore=/tmp/cm-auto-global_truststore.jks
truststoreType=JKS
truststorePasswd=[YOUR_TRUSTSTORE_PWD]
proxiedEntity=[YOUR_USER_AUTHORIZED_IN_RANGER] Notes: I'm not going to expose the method to get your password for keystore and truststore It is important that you add a proxied entity with your workload user that is authorized in ranger to use the registry (find your user in the CDP management console) Next, download the flow you want to import, for instance: wget https://raw.githubusercontent.com/paulvid/datasets/master/hybrid-demo/nifi-flow/NiFi_Flow.json Step 3: Run the import Run these command lines: $ /opt/cloudera/parcels/CFM-2.0.6.0/TOOLKIT/bin/cli.sh
Apache (_) .' ..](_) ,
_ .--. __ _| |_ __ )\
[ `.-. | [ |'-| |-'[ | / \
| | | | | | | | | | ' '
[___||__][___][___] [___]', ,'
`'
CLI v1.11.4.2.0.6.0-27
Type 'help' to see a list of available commands, use tab to auto-complete.
Session loaded from /home/pvidal/.nifi-cli.config
#> registry create-bucket -p registry.properties --bucketName hybrid-cloud
dfc33699-0317-4893-82c4-8a12ad6ed822
#> registry create-flow -p registry.properties -b dfc33699-0317-4893-82c4-8a12ad6ed822 -fn hybridflow
8d2e7f87-f176-4f34-9788-72be034e4a3f
#> registry import-flow-version -p registry.properties -f 8d2e7f87-f176-4f34-9788-72be034e4a3f -i NiFi_Flow.json
1
#> exit Step 4: Import the flow in the UI Navigate to the NiFi UI, and add a processor group. Then, click Import: Select the bucket and flow we just imported using CLI: After clicking Import, your flow is successfully available!
... View more
12-03-2020
06:06 AM
2 Kudos
Final article of the hybrid cloud series (see parent article here), and it's the funnest one! In this tutorial, we will learn to use Cloudera Viz to create visual apps. We will not go in depth into how to use viz, rather we will import an already existing app, executing on this part of the tutorial flow: Prerequisites Complete Part 3 of the tutorial series. Step 1: Import Viz App Navigate to your Management console > Data Warehouse > Open Cloudera Viz: You will be redirected to the default examples; Navigate to the DATA tab: Then, click on Default Hive VW (this is your CDW VW): Then, click on Import Visual Artifacts: Finally, import the viz app (link here) as follows: After clicking on Import, you will see a new app: Step 2: Update each maps with Mapbox token In order for the maps to display, you will have to add a mapbox token. I will only detail one of the three dashboard edit here, but you will need to do it for each dashboard. For this, first create an account at https://account.mapbox.com/auth/signup/ and copy your access token: Then, go to the visual app and click the Edit button: From there, click the dashboard to edit: Once clicked, click Edit: From there, click on the map edit gear > Settings > add your Mapbox token > Save: Step 3: Use App Once all the dashboards have been updated with the appropriate token, launch the app: You will then be able to have a daily updated visual of the Covid cases and the risk of all branches to re-open based on the spread of the virus:
... View more
Labels:
12-02-2020
04:23 PM
2 Kudos
Welcome to Part 3 of my article series on how to harness the hybrid cloud series (see parent article here). In this tutorial you will learn to use NiFi to: Pull information from public APIs Push this raw data to secure S3 bucket using SDX integration Create Hive tables on top of this data by connecting to CDW This corresponds to step 3 of the series, as explained below: Note: The anonymized NiFi flow can be found here. Prerequisites Complete Part 2 of this article series A NiFi Datahub in the environment you created for Part 1 and Part 2 Step 1: Prepare your NiFi Datahub for JDBC connection to Hive in CDW Download driver jar Navigate to your Management console > Data Warehouse > find your virtual warehouse and download the driver jar: Upload jar to NiFi nodes Navigate to the Management console > your NiFi datahub > Hardware and note the public IPs of the NiFi nodes: Using these public IPs (here hidden), upload the downloaded jar to all three nodes (see example code below, using your workload user and password) $ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_1]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_2]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_3]:/tmp/
$ ssh [YOUR_USER]@[NODE_1] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_2] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_3] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar Copy the JDBC URL Navigate to your Management console > Data Warehouse > find your virtual warehouse and copy the JDBC URL: You should get something along these lines: jdbc:hive2://[YOUR_CDW_HOST]/default;transportMode=http;httpPath=cliservice;ssl=true;retries=3 Step 2: Configure the NiFi flow This tutorial requires the creation of two NiFi flow (one to map zip codes to attitude and longitude, and one to get the latest covid cases numbers): In this tutorial, I will only detail the configuration of one of the two flows as they are almost identical except for file/table/field names. The full example code is in this gist. Here is the overview of the US Geocode flow: Get data from API to secure S3 using SDX For this part, we first use a simple configuration of an invoke http: The remote URL called is: https://data.healthcare.gov/api/views/52wv-g36k/rows.csv?accessType=DOWNLOAD&api_foundry=true Then, we use a replace attribute to replace the filename and make sure we override data: Finally, we use a put HDFS with the following parameters: Hadoop Configuration Resources: /etc/hadoop/conf.cloudera.core_settings/core-site.xml Kerberos Principal: [your workload user] Kerberos Password: [your workload password] Directory: s3a://[your env bucket]/raw/geocodes_by_county/ Conflict Resolution Strategy: replace Drop and create tables For both drop and create tables, we first use a Replace Text to send the query. For example: with replacement value of: drop TABLE if exists worldwidebank.us_geocodes_by_county;
drop TABLE if exists worldwidebank.us_geocodes_by_county_ext; Then we use a puthive3QL with default parameters: The only thing needed to configure to make this work is the Hive3ConnectionPool, configured as follows: Database Connection URL: [your JDBC URL] Database User: [your workload user] Password: [your workload password] Step 3: Verify Data Creation After executing both flow, navigate to Hue from CDW and look at the data, as such:
... View more
Labels:
11-13-2020
09:41 AM
2 Kudos
Welcome to Part 2 of our harness the hybrid cloud series. In this tutorial, we will learn how to use Data Catalog, Atlas, and Ranger to profile and protect sensitive data in CDP Public Cloud, as depicted below:
Prerequisites
Complete Part 1 of the series
Step 1: Launch Data Profiling
CDP Data Catalog comes with data profilers out of the box. You can of course customize them, but in our datasets, we will use the standard data profilers.
Launch Profiler Cluster
Navigate to your CDP Management Console > Data Catalog > Select your environment > Launch Profilers:
This will launch a datahub cluster to run the data profiling spark jobs. Wait for the cluster to be built, like in the following screenshot:
Verify Profiler execution
Navigate back to your Data Catalog > Profilers > Select your env > Cluster Sensitivity Profiler, and verify that profilers have run successfully:
Check profiled data
Go to Search and find the employees Hive table:
In the employees table, go to Schema and check the automated tags created:
Step 2: Create Tag Based Policy
Navigate to Ranger
In Data Catalog, go to the Policy tab and navigate to a policy to open Ranger:
In Ranger, go to Tag based Policies:
Open the cm_tags service:
Navigate to Masking to Add a new policy:
Create Masking Rule
Configure the masking rule as depicted in the following screenshot:
Give it a name (for example, mask_creditcard)
Select the dp_credicard tag (dp prefix standing for data profiler)
Select the Group or user for which this policy should apply (here pvidal)
Select Access Type: Hive, Select
Select Masking Option: Redact
Step 3: Verify Security Rule
Go back to your management console Data Warehouse and open Hue for your virtual warehouse:
Run the following query and observe masked results: select ccnumber from worldwidebank.employees
As you observed, CDP makes it very easy to secure your data in the cloud. Next step, enrich this data with NiFi!
... View more
Labels:
11-04-2020
08:24 AM
1 Kudo
As explained in this Parent article, here is the first step in our hybrid cloud implementation: replicating bank branches and employee data from an on-prem cluster to CDP Public Cloud. Prerequisites A CDP Base Cluster (with admin rights and user part of HDFS supergroup) A CDP Public Cloud environment (with admin rights) CDW virtual warehouse attached to Public Cloud environment Note: you can find the datasets used for this exercise here Step 1: Register CDP Base as classic cluster Start registration In your CDP Public Cloud management console, go to Classic Clusters > Add Cluster > CDH and enter your CDP Base cluster information: You will then see your cluster registration in progress: Install a secure communication tunnel Click on the Files button in Install files and follow the instructions: The following are some example instructions to do this on your CM node. Download from your management console the ssh_tunnel_setup_files.zip archive: Copy it to your CM node $ scp -i [your_key_location] ssh_tunnel_setup_files.zip [your_user_with_sudo_privileges]@[your_host]:/home/[your_user_with_sudo_privileges] SSH to CM node and instal ccm autossh $ ssh -i [your_key_location] [your_user_with_sudo_privileges]@[your_host]
$ sudo su
$ wget https://archive.cloudera.com/ccm/0.1/ccm-autossh-client-0.1-20191023211905gitd03880c.x86_64.rpm
$ yum -y --nogpgcheck localinstall ccm-autossh-client-0.1-20191023211905gitd03880c.x86_64.rpm Install Tunnel $ unzip ssh_tunnel_setup_files.zip
$ ./install.sh Post the installation you should see a message like this: ==========================================================================================
SSH tunnel for CM established successfully.
Run 'ccm-tunnel status' for status
Run 'journalctl -f -u ccm-tunnel@CM.service' or 'journalctl -xe' for logs.
========================================================================================== Finish Registration In your Management console, click on Test Connection: Once the connection is successful, you can click on Register, add your CM user/pw and connect: Finally, enter the location of your base cluster (to display on dashboard map): You have now successfully established a secure tunnel between CDP Base and CDP Public Cloud: Step 2: Create a Replication Policy Navigate to Replication Manager > Classic Clusters > 3 dots on your cluster > Add Policy: In our case, we are going to replicate 2 datasets from HDFS: Employee data Bank location data In Step 1, give a policy name and select HDFS: In Step 2, add the location of your dataset and name of your superuser: In Step 3, select s3 add your AWS credentials: After validation, enter the target bucket (your environment cloud storage) and validate: For the next two steps, use default settings: After you click create, you will see the replication policy progressing. Wait for it to complete successfully then move on to the next step: Step 3: Create external and managed tables in CDW Navigate to CDW > 3 dots of your virtual warehouse > Open Hue: In your query editor, run the following queries (adapting to your s3 path of course): create database if not exists worldwidebank;
use worldwidebank;
CREATE EXTERNAL TABLE if not exists worldwidebank.employees_ext(
number int,
location int,
gender string,
title string,
givenname string,
middleinitial string,
surname string,
streetaddress string,
city string,
state string,
statefull string,
zipcode string,
country string,
countryfull string,
emailaddress string,
username string,
password string,
telephonenumber string,
telephonecountrycode string,
mothersmaiden string,
birthday string,
age int,
tropicalzodiac string,
cctype string,
ccnumber string,
cvv2 string,
ccexpires string,
ssn string,
insuranceid string,
salary string,
bloodtype string,
weight double,
height int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://pvi-e2e-cdp-bucket/vizbank/raw/employees/'
tblproperties("skip.header.line.count"="1");
CREATE EXTERNAL TABLE if not exists worldwidebank.locations_ext(
LOCATION_ID int,
ADDRESS string,
BKCLASS string,
CBSA string,
CBSA_DIV string,
CBSA_DIV_FLG string,
CBSA_DIV_NO string,
CBSA_METRO string,
CBSA_METRO_FLG string,
CBSA_METRO_NAME string,
CBSA_MICRO_FLG string,
CBSA_NO string,
CERT string,
CITY string,
COUNTY string,
CSA string,
CSA_FLG string,
CSA_NO string,
ESTYMD string,
FI_UNINUM string,
MAINOFF string,
NAME string,
OFFNAME string,
OFFNUM string,
RUNDATE string,
SERVTYPE string,
STALP string,
STCNTY string,
STNAME string,
UNINUM string,
ZIP int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://pvi-e2e-cdp-bucket/vizbank/raw/locations/'
tblproperties("skip.header.line.count"="1");
create table worldwidebank.employees as select * from worldwidebank.employees_ext;
create table worldwidebank.locations as select * from worldwidebank.locations_ext;
CREATE MATERIALIZED VIEW worldwidebank.employees_per_state as select locations.stname, count(*) as num_employees from employees, locations where employees.location=locations.location_id GROUP BY locations.stname; And that's it, you now have replicated data from your base cluster to CDP: The next step will be to profile sensitive data to protect our employees' data.
... View more
10-31-2020
01:35 PM
1 Kudo
I'm back with a new article series like I did previously with news author personality recognition, beast mode quotient, and AI to edge (though this one got recently replaced with identifying Magic: The Gathering cards). In this series, I will showcase how to harness the true power of Cloudera Data Platform (CDP) Hybrid cloud capabilities. Throughout the series you will learn how to use CDP Private Cloud Base, Replication Manager, CDP Public Cloud, Nifi, Kafka on data hub, Cloudera Data Warehouse, and Cloudera Viz. Reminder: CDP Vision CDP is designed to seamlessly enable you to deploy any data workloads (data collection, streaming, enrichment, engineering, serving, and AI/ML), on any infrastructure, with the latest engines while maintaining a coherent layer of security and governance (SDX). Case Study: Worldwide Bank For the purpose of this article, I will use an example of a fake bank (Worldwide Bank). Worldwide Bank is a large international bank that leverages a traditional big data architecture on-premises (CDP PvC Base) for data engineering and data warehousing over petabytes of data. With COVID-19 taking the world through unprecedented times, competition is at its highest, accelerating its data organization through their adoption of the latest technologies and architectures, especially cloud infrastructures. Their first use case on this new technology platform is to create a visual report assessing the risk of every one of its branches as the virus spreads. The implementation of this first use case has the following critical considerations: Speed of implementation/cloud adoption Maintenance of data privacy/security standards Re-use of current team skillset (i.e. portability) Implementation Architecture After carefully considering options, the bank selected CDP as their hybrid architecture as it satisfies all their needs. Specifically, here is their implementation design: This article series will guide you through these four steps: Replicate bank branches and employee data (Replication Manager, Cloudera Manager, S3, HDFS). Profile sensitive data and apply data protection (Data Catalog profilers, Atlas, Ranger). Enrich data by streaming COVID statistics (Nifi). Create interactive visual reports (Cloudera Data Warehouse, Hive LLAP, Viz). Note: all assets for this series can be found here.
... View more