Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
01-13-2021
07:09 AM
@joyabrata you have a few options: - you can study the documentation expectations for outbound dettails here https://docs.cloudera.com/management-console/cloud/proxy/topics/mc-whitelist-urls-environment.html - you can use automation scripts like: https://github.com/paulvid/cdp-one-click/ to set up everything end to end.
... View more
12-23-2020
07:04 AM
3 Kudos
In a previous article, I realized that I saved my flow as a flow file instead of a template, which may make it hard to import for some. So in this article, I will explain how to import a flow to NiFi registry in Datahub (knowing that NiFi registry is secure by default in CDP Datahub). Step 1: Download and configure stores Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host] Next, copy and authorize the key and trust stores: $ sudo su
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp
$ chmod a+rw /tmp/cm-auto-host_keystore.jks
$ chmod a+rw /tmp/cm-auto-global_truststore.jks Step 2: Create a registry.properties file Go to /tmp (for instance) and create the following file - registry.properties file: baseUrl=https://fod-nifi-cluster-gateway0.fod-cdp.a465-9q4k.cloudera.site:18433
keystore=/tmp/cm-auto-host_keystore.jks
keystoreType=JKS
keystorePasswd=[YOUR_KEYSTORE_PWD]
keyPasswd=[YOUR_KEYSTORE_PWD]
truststore=/tmp/cm-auto-global_truststore.jks
truststoreType=JKS
truststorePasswd=[YOUR_TRUSTSTORE_PWD]
proxiedEntity=[YOUR_USER_AUTHORIZED_IN_RANGER] Notes: I'm not going to expose the method to get your password for keystore and truststore It is important that you add a proxied entity with your workload user that is authorized in ranger to use the registry (find your user in the CDP management console) Next, download the flow you want to import, for instance: wget https://raw.githubusercontent.com/paulvid/datasets/master/hybrid-demo/nifi-flow/NiFi_Flow.json Step 3: Run the import Run these command lines: $ /opt/cloudera/parcels/CFM-2.0.6.0/TOOLKIT/bin/cli.sh
Apache (_) .' ..](_) ,
_ .--. __ _| |_ __ )\
[ `.-. | [ |'-| |-'[ | / \
| | | | | | | | | | ' '
[___||__][___][___] [___]', ,'
`'
CLI v1.11.4.2.0.6.0-27
Type 'help' to see a list of available commands, use tab to auto-complete.
Session loaded from /home/pvidal/.nifi-cli.config
#> registry create-bucket -p registry.properties --bucketName hybrid-cloud
dfc33699-0317-4893-82c4-8a12ad6ed822
#> registry create-flow -p registry.properties -b dfc33699-0317-4893-82c4-8a12ad6ed822 -fn hybridflow
8d2e7f87-f176-4f34-9788-72be034e4a3f
#> registry import-flow-version -p registry.properties -f 8d2e7f87-f176-4f34-9788-72be034e4a3f -i NiFi_Flow.json
1
#> exit Step 4: Import the flow in the UI Navigate to the NiFi UI, and add a processor group. Then, click Import: Select the bucket and flow we just imported using CLI: After clicking Import, your flow is successfully available!
... View more
12-03-2020
06:06 AM
2 Kudos
Final article of the hybrid cloud series (see parent article here), and it's the funnest one! In this tutorial, we will learn to use Cloudera Viz to create visual apps. We will not go in depth into how to use viz, rather we will import an already existing app, executing on this part of the tutorial flow: Prerequisites Complete Part 3 of the tutorial series. Step 1: Import Viz App Navigate to your Management console > Data Warehouse > Open Cloudera Viz: You will be redirected to the default examples; Navigate to the DATA tab: Then, click on Default Hive VW (this is your CDW VW): Then, click on Import Visual Artifacts: Finally, import the viz app (link here) as follows: After clicking on Import, you will see a new app: Step 2: Update each maps with Mapbox token In order for the maps to display, you will have to add a mapbox token. I will only detail one of the three dashboard edit here, but you will need to do it for each dashboard. For this, first create an account at https://account.mapbox.com/auth/signup/ and copy your access token: Then, go to the visual app and click the Edit button: From there, click the dashboard to edit: Once clicked, click Edit: From there, click on the map edit gear > Settings > add your Mapbox token > Save: Step 3: Use App Once all the dashboards have been updated with the appropriate token, launch the app: You will then be able to have a daily updated visual of the Covid cases and the risk of all branches to re-open based on the spread of the virus:
... View more
Labels:
12-02-2020
04:23 PM
2 Kudos
Welcome to Part 3 of my article series on how to harness the hybrid cloud series (see parent article here). In this tutorial you will learn to use NiFi to: Pull information from public APIs Push this raw data to secure S3 bucket using SDX integration Create Hive tables on top of this data by connecting to CDW This corresponds to step 3 of the series, as explained below: Note: The anonymized NiFi flow can be found here. Prerequisites Complete Part 2 of this article series A NiFi Datahub in the environment you created for Part 1 and Part 2 Step 1: Prepare your NiFi Datahub for JDBC connection to Hive in CDW Download driver jar Navigate to your Management console > Data Warehouse > find your virtual warehouse and download the driver jar: Upload jar to NiFi nodes Navigate to the Management console > your NiFi datahub > Hardware and note the public IPs of the NiFi nodes: Using these public IPs (here hidden), upload the downloaded jar to all three nodes (see example code below, using your workload user and password) $ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_1]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_2]:/tmp/
$ scp hive-jdbc-3.1.0-SNAPSHOT-standalone.jar [YOUR_USER]@[NODE_3]:/tmp/
$ ssh [YOUR_USER]@[NODE_1] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_2] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar
$ ssh [YOUR_USER]@[NODE_3] chmod a+rw /tmp/hive-jdbc-3.1.0-SNAPSHOT-standalone.jar Copy the JDBC URL Navigate to your Management console > Data Warehouse > find your virtual warehouse and copy the JDBC URL: You should get something along these lines: jdbc:hive2://[YOUR_CDW_HOST]/default;transportMode=http;httpPath=cliservice;ssl=true;retries=3 Step 2: Configure the NiFi flow This tutorial requires the creation of two NiFi flow (one to map zip codes to attitude and longitude, and one to get the latest covid cases numbers): In this tutorial, I will only detail the configuration of one of the two flows as they are almost identical except for file/table/field names. The full example code is in this gist. Here is the overview of the US Geocode flow: Get data from API to secure S3 using SDX For this part, we first use a simple configuration of an invoke http: The remote URL called is: https://data.healthcare.gov/api/views/52wv-g36k/rows.csv?accessType=DOWNLOAD&api_foundry=true Then, we use a replace attribute to replace the filename and make sure we override data: Finally, we use a put HDFS with the following parameters: Hadoop Configuration Resources: /etc/hadoop/conf.cloudera.core_settings/core-site.xml Kerberos Principal: [your workload user] Kerberos Password: [your workload password] Directory: s3a://[your env bucket]/raw/geocodes_by_county/ Conflict Resolution Strategy: replace Drop and create tables For both drop and create tables, we first use a Replace Text to send the query. For example: with replacement value of: drop TABLE if exists worldwidebank.us_geocodes_by_county;
drop TABLE if exists worldwidebank.us_geocodes_by_county_ext; Then we use a puthive3QL with default parameters: The only thing needed to configure to make this work is the Hive3ConnectionPool, configured as follows: Database Connection URL: [your JDBC URL] Database User: [your workload user] Password: [your workload password] Step 3: Verify Data Creation After executing both flow, navigate to Hue from CDW and look at the data, as such:
... View more
Labels:
11-13-2020
09:41 AM
2 Kudos
Welcome to Part 2 of our harness the hybrid cloud series. In this tutorial, we will learn how to use Data Catalog, Atlas, and Ranger to profile and protect sensitive data in CDP Public Cloud, as depicted below:
Prerequisites
Complete Part 1 of the series
Step 1: Launch Data Profiling
CDP Data Catalog comes with data profilers out of the box. You can of course customize them, but in our datasets, we will use the standard data profilers.
Launch Profiler Cluster
Navigate to your CDP Management Console > Data Catalog > Select your environment > Launch Profilers:
This will launch a datahub cluster to run the data profiling spark jobs. Wait for the cluster to be built, like in the following screenshot:
Verify Profiler execution
Navigate back to your Data Catalog > Profilers > Select your env > Cluster Sensitivity Profiler, and verify that profilers have run successfully:
Check profiled data
Go to Search and find the employees Hive table:
In the employees table, go to Schema and check the automated tags created:
Step 2: Create Tag Based Policy
Navigate to Ranger
In Data Catalog, go to the Policy tab and navigate to a policy to open Ranger:
In Ranger, go to Tag based Policies:
Open the cm_tags service:
Navigate to Masking to Add a new policy:
Create Masking Rule
Configure the masking rule as depicted in the following screenshot:
Give it a name (for example, mask_creditcard)
Select the dp_credicard tag (dp prefix standing for data profiler)
Select the Group or user for which this policy should apply (here pvidal)
Select Access Type: Hive, Select
Select Masking Option: Redact
Step 3: Verify Security Rule
Go back to your management console Data Warehouse and open Hue for your virtual warehouse:
Run the following query and observe masked results: select ccnumber from worldwidebank.employees
As you observed, CDP makes it very easy to secure your data in the cloud. Next step, enrich this data with NiFi!
... View more
Labels:
11-04-2020
08:24 AM
1 Kudo
As explained in this Parent article, here is the first step in our hybrid cloud implementation: replicating bank branches and employee data from an on-prem cluster to CDP Public Cloud. Prerequisites A CDP Base Cluster (with admin rights and user part of HDFS supergroup) A CDP Public Cloud environment (with admin rights) CDW virtual warehouse attached to Public Cloud environment Note: you can find the datasets used for this exercise here Step 1: Register CDP Base as classic cluster Start registration In your CDP Public Cloud management console, go to Classic Clusters > Add Cluster > CDH and enter your CDP Base cluster information: You will then see your cluster registration in progress: Install a secure communication tunnel Click on the Files button in Install files and follow the instructions: The following are some example instructions to do this on your CM node. Download from your management console the ssh_tunnel_setup_files.zip archive: Copy it to your CM node $ scp -i [your_key_location] ssh_tunnel_setup_files.zip [your_user_with_sudo_privileges]@[your_host]:/home/[your_user_with_sudo_privileges] SSH to CM node and instal ccm autossh $ ssh -i [your_key_location] [your_user_with_sudo_privileges]@[your_host]
$ sudo su
$ wget https://archive.cloudera.com/ccm/0.1/ccm-autossh-client-0.1-20191023211905gitd03880c.x86_64.rpm
$ yum -y --nogpgcheck localinstall ccm-autossh-client-0.1-20191023211905gitd03880c.x86_64.rpm Install Tunnel $ unzip ssh_tunnel_setup_files.zip
$ ./install.sh Post the installation you should see a message like this: ==========================================================================================
SSH tunnel for CM established successfully.
Run 'ccm-tunnel status' for status
Run 'journalctl -f -u ccm-tunnel@CM.service' or 'journalctl -xe' for logs.
========================================================================================== Finish Registration In your Management console, click on Test Connection: Once the connection is successful, you can click on Register, add your CM user/pw and connect: Finally, enter the location of your base cluster (to display on dashboard map): You have now successfully established a secure tunnel between CDP Base and CDP Public Cloud: Step 2: Create a Replication Policy Navigate to Replication Manager > Classic Clusters > 3 dots on your cluster > Add Policy: In our case, we are going to replicate 2 datasets from HDFS: Employee data Bank location data In Step 1, give a policy name and select HDFS: In Step 2, add the location of your dataset and name of your superuser: In Step 3, select s3 add your AWS credentials: After validation, enter the target bucket (your environment cloud storage) and validate: For the next two steps, use default settings: After you click create, you will see the replication policy progressing. Wait for it to complete successfully then move on to the next step: Step 3: Create external and managed tables in CDW Navigate to CDW > 3 dots of your virtual warehouse > Open Hue: In your query editor, run the following queries (adapting to your s3 path of course): create database if not exists worldwidebank;
use worldwidebank;
CREATE EXTERNAL TABLE if not exists worldwidebank.employees_ext(
number int,
location int,
gender string,
title string,
givenname string,
middleinitial string,
surname string,
streetaddress string,
city string,
state string,
statefull string,
zipcode string,
country string,
countryfull string,
emailaddress string,
username string,
password string,
telephonenumber string,
telephonecountrycode string,
mothersmaiden string,
birthday string,
age int,
tropicalzodiac string,
cctype string,
ccnumber string,
cvv2 string,
ccexpires string,
ssn string,
insuranceid string,
salary string,
bloodtype string,
weight double,
height int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://pvi-e2e-cdp-bucket/vizbank/raw/employees/'
tblproperties("skip.header.line.count"="1");
CREATE EXTERNAL TABLE if not exists worldwidebank.locations_ext(
LOCATION_ID int,
ADDRESS string,
BKCLASS string,
CBSA string,
CBSA_DIV string,
CBSA_DIV_FLG string,
CBSA_DIV_NO string,
CBSA_METRO string,
CBSA_METRO_FLG string,
CBSA_METRO_NAME string,
CBSA_MICRO_FLG string,
CBSA_NO string,
CERT string,
CITY string,
COUNTY string,
CSA string,
CSA_FLG string,
CSA_NO string,
ESTYMD string,
FI_UNINUM string,
MAINOFF string,
NAME string,
OFFNAME string,
OFFNUM string,
RUNDATE string,
SERVTYPE string,
STALP string,
STCNTY string,
STNAME string,
UNINUM string,
ZIP int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3a://pvi-e2e-cdp-bucket/vizbank/raw/locations/'
tblproperties("skip.header.line.count"="1");
create table worldwidebank.employees as select * from worldwidebank.employees_ext;
create table worldwidebank.locations as select * from worldwidebank.locations_ext;
CREATE MATERIALIZED VIEW worldwidebank.employees_per_state as select locations.stname, count(*) as num_employees from employees, locations where employees.location=locations.location_id GROUP BY locations.stname; And that's it, you now have replicated data from your base cluster to CDP: The next step will be to profile sensitive data to protect our employees' data.
... View more
10-31-2020
01:35 PM
1 Kudo
I'm back with a new article series like I did previously with news author personality recognition, beast mode quotient, and AI to edge (though this one got recently replaced with identifying Magic: The Gathering cards). In this series, I will showcase how to harness the true power of Cloudera Data Platform (CDP) Hybrid cloud capabilities. Throughout the series you will learn how to use CDP Private Cloud Base, Replication Manager, CDP Public Cloud, Nifi, Kafka on data hub, Cloudera Data Warehouse, and Cloudera Viz. Reminder: CDP Vision CDP is designed to seamlessly enable you to deploy any data workloads (data collection, streaming, enrichment, engineering, serving, and AI/ML), on any infrastructure, with the latest engines while maintaining a coherent layer of security and governance (SDX). Case Study: Worldwide Bank For the purpose of this article, I will use an example of a fake bank (Worldwide Bank). Worldwide Bank is a large international bank that leverages a traditional big data architecture on-premises (CDP PvC Base) for data engineering and data warehousing over petabytes of data. With COVID-19 taking the world through unprecedented times, competition is at its highest, accelerating its data organization through their adoption of the latest technologies and architectures, especially cloud infrastructures. Their first use case on this new technology platform is to create a visual report assessing the risk of every one of its branches as the virus spreads. The implementation of this first use case has the following critical considerations: Speed of implementation/cloud adoption Maintenance of data privacy/security standards Re-use of current team skillset (i.e. portability) Implementation Architecture After carefully considering options, the bank selected CDP as their hybrid architecture as it satisfies all their needs. Specifically, here is their implementation design: This article series will guide you through these four steps: Replicate bank branches and employee data (Replication Manager, Cloudera Manager, S3, HDFS). Profile sensitive data and apply data protection (Data Catalog profilers, Atlas, Ranger). Enrich data by streaming COVID statistics (Nifi). Create interactive visual reports (Cloudera Data Warehouse, Hive LLAP, Viz). Note: all assets for this series can be found here.
... View more
10-27-2020
02:57 PM
1 Kudo
Note: This article is all thanks to Sumit Prakash for diving into the Ranger source code and explaining to me how to do this!
When developing new plugins for Ranger, these plugins need to be able to download the corresponding Ranger policies. In a secure Ranger setup (which is the preferred way), you download them via the secure download API endpoint: /plugins/secure/policies/download/{serviceName}
This API requires authentication. In this article, I will highlight how to set up a read-only user to download a policy.
Step 1: Create a read-only user
As a Ranger administrator, go to Settings > Users/Groups/Roles:
Then, click on Add New User and create a user with the auditor role, as follows :
Step 2: Give download access to this user
In the Ranger home screen, edit the service you want to download (here Hadoop SQL, or it's technical name cm_hive):
Add the recently created user to the policy.download.auth.users configuration:
Step 3: Test the download API
For this, you can run the following curl command:
curl -Lku auditor:[password] -H "Accept: application/json" -H "Content-type:application/json" "https://[ranger_url]:6182/service/plugins/secure/policies/download/cm_hive"
You should get a response that looks like the following:
{"serviceName":"cm_hive","serviceId":5,"policyVersion":10,"policyUpdateTime":1603825207732,"policies":[{"id":7,"guid":"78892229-bea4-421f-85fd-8214e88e3c21","isEnabled":true,"version":1,"service":"cm_hive","name":"all - global","policyType":0,"policyPriority":0,"description":"Policy for all - global","isAuditEnabled":true,"resources":{"global":{"values":["*"],"isExcludes":false,"isRecursive":false}},"policyItems":[{"accesses":[{"type":"select","isAllowed":true},{"type":"update","isAllowed":true},{"type":"create","isAllowed":true},{"type":"drop","isAllowed":true},{"type":"alter","isAllowed":true},{"type":"index","isAllowed":true},{"type":"lock","isAllowed":true},{"type":"all","isAllowed":true},{"type":"read","isAllowed":true},{"type":"write","isAllowed":true},{"type":"repladmin","isAllowed":true},{"type":"serviceadmin","isAllowed":true},{"type":"tempudfadmin","isAllowed":true},{"type":"refresh","isAllowed":true}],"users":["hive","beacon","dpprofiler","hue","admin","impala"],"groups":[],"roles":[],"conditions":[],"delegateAdmin":true},{"accesses":[{"type":"read","isAllowed":true}],"users":["rangerlookup"],"groups":[],"roles":[],"conditions":[],"delegateAdmin":false}],"denyPolicyItems":[],"allowExceptions":[],"denyExceptions":[],"dataMaskPolicyItems":[],"rowFilterPolicyItems":[],"serviceType":"hive","options":{},"validitySchedules":[],"policyLabels":[],"zoneName":"","isDenyAllElse":false},{"id":8,"guid":"211690a6-6fb9-41e1-99ba-bb00a46adedb","isEnabled":true,"version":1,"service":"cm_hive","name":"all - database, table, column","policyType":0,"policyPriority":0,"description":"Policy for all - database, table, column","isAuditEnabled":true,"resources":{"database":{"values":["*"],"isExcludes":false,"isRecursive":false},"column":{"values":["*"],"isExcludes":false,"isRecursive":false},"table":{"values":["*"],"isExcludes":false,"isRecursive":false}},"policyItems":[{"accesses":[{"type":"select","isAllowed":true},{"type":"update","isAllowed":true},{"type":"create","isAllowed":true},{"type":"drop","isAllowed":true},{"type":"alter","isAllowed":true},{"type":"index","isAllowed":true},{"type":"lock","isAllowed":true},{"type":"all","isAllowed":true},{"type":"read","isAllowed":true},{"type":"write","isAllowed":true},{"type":"repladmin","isAllowed":true},{"type":"serviceadmin","isAllowed":true},{"type":"tempudfadmin","isAllowed":true},{"type":"refresh","isAllowed":true}],"users":["hive","beacon","dpprofiler","hue","admin","impala"],"groups":[],"roles":[],"conditions":[],"delegateAdmin":true},{"accesses":[{"type":"read","isAllowed":true}],"users":["rangerlookup"],"groups":[],"roles":[],"conditions":[],"delegateAdmin":false},{"accesses":[{"type":"all","isAllowed":true},{"type":"drop","isAllowed":true},{"type":"serviceadmin","isAllowed":true},{"type":"select","isAllowed":true},{"type":"read","isAllowed":true},{"type":"update","isAllowed":true},{"type":"create","isAllowed":true},{"type":"index","isAllowed":true},{"type":"lock","isAllowed":true},{"type":"refresh","isAllowed":true},{"type":"repladmin","isAllowed":true},{"type":"write","isAllowed":true},{"type":"alter","isAllowed":true}],"users":["{OWNER}"],"groups":[],"roles":[],"conditions":[],"delegateAdmin":true}],"denyPolicyItems":[],"allowExceptions":[],"denyExceptions":[],"dataMaskPolicyItems":[],"rowFilterPolicyItems":[],"serviceType":"hive","options":{},"validitySchedules":[],"policyLabels":[],"zoneName":"","isDenyAllElse":false},{"id":9,"guid":"3b6489dd-e76d-408f-bac0-c5cba4bdb2ac","isEnabled":true,"version":1,"service":"cm_hive","name":"all - database, table","policyType":0,"policyPriority":0,"description":"Policy for all - database, table","isAuditEnabled":true,"resources":{"database":{"values":["*"],"isExcludes":false,"isRecursive":false},"table":{"values":["*"],"isExcludes":false,"isRecursive":false}},"policyItems":[{"accesses":[{"type":"select","isAllowed":true},{"type":"update","isAllowed":true},{"type":"create","isAllowed":true},{"type":"drop","isAllowed":true},{"type":"alter","isAllowed":true},{"type":"index","isAllowed":true},{"type":"lock","isAllowed":true},{"type":"all","isAllowed":true},{"type":"read","isAllowed":true},{"type":"write","isAllowed":true},{"type":"repladmin","isAllowed":true},{"type":"serviceadmin","isAllowed":true},{"type":"tempudfadmin","isAllowed":true},{"type":"refresh","isAllowed":true}],"users":["hive","beacon","dpprofiler","hue","admin","impala"],"groups":[],"roles":[],"conditions":[],"delegateAdmin":true},{"accesses":[{"type":"read","isAllowed":true}],"users":["rang [...]
... View more
Labels:
10-20-2020
02:20 PM
1 Kudo
Cloudera Data Platform Public Cloud recently introduced the ability to backup and restore datalake from a saved location. Specifically, the backup operation saves a full snapshot of data from all SDX services:
Atlas:
Audit events, saved in HBase tables
Lineage data, saved as Janus graph data in HBase tables
Edge, vertex, and full text indexes, saved in Solr collections
Ranger:
Audit logs, saved as a Solr collection
Permissions and tags, saved in RDBMS tables
HMS Metadata, saved in RDBMS tables
In this article, I will detail how to run backup and restore in CDP Public Cloud in AWS, via the CDP CLI.
Pre-Requisites
Stop operations that could affect backup
Make sure that no HMS affecting operations are running (e.g. creating a table from CDW or a datahub)
Go to your Datalake Cloudera Manager, and shut down:
Atlas
Ranger
HMS
Make sure you have the proper IAM permissions
Datalake backup uses both the Ranger Audit Role and Datalake Admin Roles to write the backups (more details on these roles here)
Therefore, the policies attached to both the IAM role must give write permissions to the location of your backup.
Here is an example of a policy attached to the Ranger Audit Role:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FullObjectAccessUnderAuditDir",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::bckp-cdp-bucket/ranger/audit/*"
},
{
"Sid": "FullObjectAccessUnderBackupDir",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::bckp-cdp-bucket/backups/*"
},
{
"Sid": "LimitedAccessToDataLakeBucket",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": "arn:aws:s3:::bckp-cdp-bucket"
}
]
}
Install and configure CDP CLI
This is fairly straightforward, and documented in your management console, under Help > Download CLI:
Step 1: Running back-up
Initiate backup
$ cdp datalake backup-datalake --datalake-name bckp-cdp-dl --backup-location s3a://bckp-cdp-bucket/backups/
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:11:27.821",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"failureReason": "null"
}
Monitor backup
$ cdp datalake backup-datalake-status --datalake-name bckp-cdp-dl
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:11:27.821",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"backupName": "",
"failureReason": "null"
}
Step 2: Restoring backup
Initiate restore
$ cdp datalake restore-datalake --datalake-name bckp-cdp-dl --backup-id 6c59a259-51ac-4db4-80d6-22f71f84cc4e
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, DATABASE=IN_PROGRESS, EDGE_INDEX_COLLECTION_DELETE=IN_PROGRESS, RANGER_AUDITS_COLLECTION_DELETE=IN_PROGRESS, VERTEX_INDEX_COLLECITON_DELETE=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=IN_PROGRESS}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:15:11.757",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"failureReason": "null"
}
Monitor restore
$ cdp datalake restore-datalake-status --datalake-name bckp-cdp-dl
{
"accountId": "558bc1d2-8867-4357-8524-311d51259233",
"restoreId": "06c0bde4-cfc7-4b9e-a8e0-d9f2ddfcb5c5",
"backupId": "6c59a259-51ac-4db4-80d6-22f71f84cc4e",
"userCrn": "crn:altus:iam:us-west-1:558bc1d2-8867-4357-8524-311d51259233:user:86c4e7d9-1560-4afa-ac14-794bdeec0896",
"internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=SUCCESSFUL, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, EDGE_INDEX_COLLECTION_DELETE=SUCCESSFUL, VERTEX_INDEX_COLLECITON_DELETE=SUCCESSFUL, RANGER_AUDITS_COLLECTION_DELETE=SUCCESSFUL, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS, FULLTEXT_INDEX_COLLECTION_DELETE=SUCCESSFUL}",
"status": "IN_PROGRESS",
"startTime": "2020-10-20 21:15:11.757",
"endTime": "",
"backupLocation": "s3a://bckp-cdp-bucket/backups/",
"failureReason": "null"
}
Note: you can also monitor these events in the CDP Control Plane:
... View more
Labels:
10-14-2020
10:21 AM
This looks like a permission issue on your Azure subscription at the time of creation. Did you manage to get involved with our partner program? We would like to help you out by sharing screen rather than finding the needle in the haystack here 🙂
... View more
10-14-2020
05:36 AM
Again, you have two issues here: 1. Making sure that your app has contributor role 2. Making sure that the identities you created with the quick start template have the right permissions If you follow the instructions I gave you (create the proper app + run the script), it should work.
... View more
10-13-2020
02:25 PM
Hi Valerio, First, regarding the app role, I think the quick start doc page is out of date (I reported this to our doc team). You do not need to create a custom role, as long as you create your credential app like this (replace subscriptionId with your ID): az ad sp create-for-rbac \
--name http://your-cloudbreak-app \
--role Contributor \
--scopes /subscriptions/{subscriptionId} Secondly, did you run step 3 completely? Specifically, after the quick start, make sure to run this in an Azure bash shell post quickstart deployment (replace YOUR_SUBSCRIPTION_ID and YOUR_RG with the values used in quickstart): export SUBSCRIPTIONID="YOUR_SUBSCRIPTION_ID"
export RESOURCEGROUPNAME="YOUR_RG"
export STORAGEACCOUNTNAME=$(az storage account list -g $RESOURCEGROUPNAME|jq '.[]|.name'| tr -d '"')
export ASSUMER_OBJECTID=$(az identity list -g $RESOURCEGROUPNAME|jq '.[]|{"name":.name,"principalId":.principalId}|select(.name | test("AssumerIdentity"))|.principalId'| tr -d '"')
export DATAACCESS_OBJECTID=$(az identity list -g $RESOURCEGROUPNAME|jq '.[]|{"name":.name,"principalId":.principalId}|select(.name | test("DataAccessIdentity"))|.principalId'| tr -d '"')
export LOGGER_OBJECTID=$(az identity list -g $RESOURCEGROUPNAME|jq '.[]|{"name":.name,"principalId":.principalId}|select(.name | test("LoggerIdentity"))|.principalId'| tr -d '"')
export RANGER_OBJECTID=$(az identity list -g $RESOURCEGROUPNAME|jq '.[]|{"name":.name,"principalId":.principalId}|select(.name | test("RangerIdentity"))|.principalId'| tr -d '"')
# Assign Managed Identity Operator role to the assumerIdentity principal at subscription scope
az role assignment create --assignee $ASSUMER_OBJECTID --role 'f1a07417-d97a-45cb-824c-7a7467783830' --scope "/subscriptions/$SUBSCRIPTIONID"
# Assign Virtual Machine Contributor role to the assumerIdentity principal at subscription scope
az role assignment create --assignee $ASSUMER_OBJECTID --role '9980e02c-c2be-4d73-94e8-173b1dc7cf3c' --scope "/subscriptions/$SUBSCRIPTIONID"
# Assign Storage Blob Data Contributor role to the loggerIdentity principal at logs filesystem scope
az role assignment create --assignee $LOGGER_OBJECTID --role 'ba92f5b4-2d11-453d-a403-e96b0029c9fe' --scope "/subscriptions/$SUBSCRIPTIONID/resourceGroups/$RESOURCEGROUPNAME/providers/Microsoft.Storage/storageAccounts/$STORAGEACCOUNTNAME/blobServices/default/containers/logs"
# Assign Storage Blob Data Owner role to the dataAccessIdentity principal at logs/data filesystem scope
az role assignment create --assignee $DATAACCESS_OBJECTID --role 'b7e6dc6d-f1e8-4753-8033-0f276bb0955b' --scope "/subscriptions/$SUBSCRIPTIONID/resourceGroups/$RESOURCEGROUPNAME/providers/Microsoft.Storage/storageAccounts/$STORAGEACCOUNTNAME/blobServices/default/containers/data"
az role assignment create --assignee $DATAACCESS_OBJECTID --role 'b7e6dc6d-f1e8-4753-8033-0f276bb0955b' --scope "/subscriptions/$SUBSCRIPTIONID/resourceGroups/$RESOURCEGROUPNAME/providers/Microsoft.Storage/storageAccounts/$STORAGEACCOUNTNAME/blobServices/default/containers/logs"
# Assign Storage Blob Data Contributor role to the rangerIdentity principal at data filesystem scope
az role assignment create --assignee $RANGER_OBJECTID --role 'ba92f5b4-2d11-453d-a403-e96b0029c9fe' --scope "/subscriptions/$SUBSCRIPTIONID/resourceGroups/$RESOURCEGROUPNAME/providers/Microsoft.Storage/storageAccounts/$STORAGEACCOUNTNAME/blobServices/default/containers/data"
Let me know if that works out for you.
... View more
10-06-2020
06:44 AM
Absolutely, we have a partner team that can work with you. More info here: https://www.cloudera.com/partners/cloudera-connect-partner-program.html
... View more
10-06-2020
05:30 AM
Hi Valerio, There is some mapping to be done to enable your permissions. I think the best way for you to move forward would be to use the resources available to you: 1. Free training, e.g. https://www.cloudera.com/about/training/courses/cloudera-essentials-for-cdp.html 2. Tutorials, e.g. https://www.cloudera.com/tutorials/cdp-how-to-create-a-data-hub.html 3. If you are a Cloudera customer, I do recommend to reach out to your account team. We have CDP experts that can help you quickly rather than asynchronously.
... View more
10-05-2020
05:54 AM
Hi there, Regarding your datahub failure, it may be due to the fact that your FQDN is too long. Could you try launching a cluster with a shorter name? As for the environment not being deleted: what error are you facing when trying to delete it?
... View more
09-30-2020
06:35 AM
1 Kudo
Hola Valerio, Looking at Screenshot (34), it looks like you used the AssumerIdentity everywhere. Instead, you should use a combination of Logger/Ranger/Assumer/DataAccess identities, as detailed here: https://docs.cloudera.com/management-console/cloud/azure-quickstart/topics/mc-azure-quickstart-environment.html Could you try with the proper identity combination and see if that helps?
... View more
09-29-2020
06:31 AM
Hi Valerio, A few things to check: 1. If you go to the datalake tab in the UI: can you access the CM UI? The logs there should tell you more 2. This is most likely a bad combination of setup of your managed identity/storage account. Best way to know what's wrong is to send us screenshots of your managed identity/storage account setup in Azure portal + how you call them in the environment creation wizard in CDP.
... View more
09-28-2020
05:14 AM
1 Kudo
Ha! Good catch!
... View more
09-25-2020
05:54 AM
Do me a favor and try this: - open a terminal session (do not use !) - run the following commands: chmod a+r /home/cdsw/drivers/hive/HiveJDBC41.jar
CLASSPATH=.:/home/cdsw/drivers/hive/HiveJDBC41.jar
export CLASSPATH - close the session and try to run your python code
... View more
09-24-2020
09:42 AM
Did you actually run the export in a terminal session, as follows? CLASSPATH=.:/home/cdsw/drivers/HiveJDBC41.jar export CLASSPATH
... View more
09-24-2020
07:03 AM
Hey, Have you checked this article? https://community.cloudera.com/t5/Community-Articles/How-to-connect-to-CDP-Impala-from-python/ta-p/296405
... View more
08-31-2020
02:25 PM
2 Kudos
Cloudera Data Warehouse (CDW) brings one of the most efficient compute isolation and rapid scale up and down of data warehousing workloads, leveraging the latest container and caching technologies.
One of the great features of this architecture is the ability to only bring compute on-demand, as illustrated by the figure below:
This default setup is the most cost-effective setup as only a few shared services nodes (small nodes running services like UIs, Viz, Zookeeper, etc.) are long lasting. Each Virtual Warehouse has a set of nodes that are only running if compute is needed (i.e. a new query on a non-cached dataset).
The caveat to this approach is that on a completely cold warehouse, the warm up type from zero to compute is 1 minute or 2.
An alternative to this default architecture is to leverage compute-reserved nodes, that will be shared between virtual warehouses, as depicted below:
With this architecture, a pool of reserved nodes can be used to enable the immediate availability of compute across nodes. In this article, I will showcase how to set up reserved instances in CDW.
Note: This article is a high-level tutorial. It is not my intent to detail the behavior of how reserved nodes are shared across warehouses, or recommend generic sizing. The number of instances and the VW behavior will depend on your implementation.
Step 1: Get your Kubeconfig
In CDW, go to your environment, click on the 3 dots on the environment box > Show Kubeconfig:
Grant your ARN access to the environment, and copy/download the kubeconfig (see this article for more details ).
Step 2: Connect to your cluster
$ export KUBECONFIG=[path_to_your_kubeconfig] $ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h46m compute-reserved-node 0/0 0 0 4h46m crusty-abalone-cluster-autoscaler 1/1 1 1 4h46m nginx-default-backend 1/1 1 1 4h46m nginx-service 3/3 3 3 4h46m shared-services-reserved-node 0/0 0 0 4h46m
Step 3: Modify the replicas of compute reserved nodes
kubectl edit deployment compute-reserved-node -n cluster
apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2020-08-31T16:28:52Z" generation: 1 labels: app.kubernetes.io/instance: trendy-mastiff app.kubernetes.io/managed-by: Tiller app.kubernetes.io/name: cluster-overprovisioner cluster-overprovisioner-name: compute-reserved-node helm.sh/chart: cluster-overprovisioner-0.2.5 name: compute-reserved-node namespace: cluster resourceVersion: "3476" selfLink: /apis/extensions/v1beta1/namespaces/cluster/deployments/compute-reserved-node uid: a5cb9ea1-729a-4665-9734-94c2f669984f spec: progressDeadlineSeconds: 600 replicas: 3
Step 4: Verify your config
After a few minutes, you should see your configuration being applied:
$ kubectl get deployments -n cluster NAME READY UP-TO-DATE AVAILABLE AGE ardent-ferret-efs-provisioner 2/2 2 2 4h54m compute-reserved-node 3/3 3 3 4h54m crusty-abalone-cluster-autoscaler 1/1 1 1 4h54m nginx-default-backend 1/1 1 1 4h54m nginx-service 3/3 3 3 4h54m shared-services-reserved-node 0/0 0 0 4h54m
... View more
Labels:
08-19-2020
01:56 PM
1 Kudo
Inspired by @sunile_manjee 's article How to use K9s to fetch metrics and logs for Cloudera Data Warehouse Experience , I decided to create the same tutorial for Cloudera Data Engineering.
The process is very similar, as you can see below.
Pre-Requisites
CDE environment setup
K9s installed on your machine (instructions here)
AWS user ARN
AWS configure (CLI) pointing to your AWS env (instructions here)
Step 1: Add your AWS user to the CDE environment
In your AWS account console, go to IAM > Users > Search for your user name > copy your ARN:
In the CDE main page, go to environment details:
Then, go to access and add your ARN:
Step 2: Download the kubeconfig
In the CDE main page, click on the 3 dots, then Download Kube Config:
Step 3: Run K9s
You can now run k9s, by running: k9s --kubeconfig ~/Downloads/kubeconfig Note: the path to your downloaded kubeconfig is subject to change of course.
... View more
08-10-2020
02:14 PM
2 Kudos
Moving data from your local machine to the cloud has never been easier using NiFi site to site protocol and CDP Datahub. In this article, I will focus on how to set up a site to site communication between your local machine and CDP Cloud, without using the default Knox CDP Proxy.
This configuration assumes that you already have a local instance of NiFi (or MiNiFi) and a CDP Datahub Cluster running NiFi. If you want to learn how to use CDP Public Cloud, please visit our overview page and documentation.
This setup will be executed in 4 steps:
Step 1: Open CDP to your local IP
Step 2: Download and configure stores on your local machine
Step 3: Configure a simple site-to-site flow
Step 4: Authorize this flow in Ranger
Step 1: Open CDP to your local IP
Go to your CDP Management Console, and find your datahub (here pvn-nifi).
At the bottom of the datahub page, click on Hardware and locate one of the instances running NiFi:
Click on the instances and you will be redirected to your cloud provider (here AWS 😞
At the bottom of the screen, click on the security group associated with your instance, and you will be redirected to that security group config page:
Click on Edit inbound rules and add a rule opening TCP port 8443 to your local IP:
Save these changes.
Step 2: Download and configure stores on your local machine
Connect to one of the NiFi machines with the Cloudbreak user and the key you used at deployment: $ ssh -i [path_to_private_key] cloudbreak@[your_nifi_host]
Copy and authorize the key and trust stores: $ sudo su
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-host_keystore.jks /tmp
$ cp /var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks /tmp
$ chmod a+rw /tmp/cm-auto-host_keystore.jks
$ chmod a+rw /tmp/cm-auto-global_truststore.jks
Disconnect from the remote machine and copy these stores: $ cd ~/Desktop
$ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-host_keystore.jks cm-auto-host_keystore.jks
$ scp -i [path_to_private_key] cloudbreak@[your_nifi_host]:/tmp/cm-auto-global_truststore.jks cm-auto-global_truststore.jks
Configure your local NiFi with these stores, by modifying your nifi.properties : nifi.security.keystore=/Users/pvidal/Desktop/cm-auto-host_keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=[keystore_pw]
nifi.security.keyPasswd=[keystore_pw]
nifi.security.truststore=/Users/pvidal/Desktop/cm-auto-global_truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=[truststore_pw] Note: To know the passwords of these stores, please connect with your Cloudera team.
Restart your local NiFi instance: nifi restart
Step 3: Configure a simple site-to-site flow
Local instance
Create a process group to host your flow (here called S2S Cloud:
In this process group, create a remote process group instance and configure it with one of your cloud NiFi instances address, and the HTTP protocol:
Create a simple Generate flow file processor and connect it to the remote processor: Note: Without configuring Ranger, you will get a Forbidden warning (see step 4).
CDP Public Instance
Create a process group to host your flow (here called Receive from on prem):
In this process group, create an input port accepting remote connections:
Finally, create a flow that takes the data and logs it:
Start your flow.
Step 4: Authorize this flow in Ranger
From the Cloudera Management console, go to Ranger and your NiFi service:
From the list of policies, create a new policy (here called s2s ) that will allow access to your specific process group and the site-to-site protocol (Ranger does auto completion):
Save this policy, and go back to your local machine; you can now enable the remote process group and start sending files!
Example of successful flows
Local Flow
CDP Public Flow
... View more
06-04-2020
11:38 AM
1 Kudo
Cloudera Machine Learning (and Cloudera Data Science Workbench) is built on a very robust and flexible framework to ease integration with third parties. In this article, I decided to explore the integration of Teradata with CML via ODBC.
A few notes before I dive in:
I could have easily used JDBC via JayDeBeApi (see my previous article), but where is the fun in that
This article could be generalized to more ODBC connections, provided proper parameters are setup
To get a fully rounded tutorial, I will go through these 3 steps:
Step 1 (optional if you already have a Teradata instance): Setting up Teradata Vantage in AWS
Step 2: Create a custom image with Teradata driver
Step 3: Configure and run python to Teradata ODBC
Step 1: Setting up Teradata Vantage in AWS
If you want to do development testing of Teradata, you can use Teradata Vantage Developer on the AWS Market Place. I am not going to dive too much into this as this configuration may change as Teradata evolves.
Pre-Requisites
The Teradata Vantage Developer is an AWS CloudFormation template, that requires the following assets to be pre-created:
An Internet Gateway
A VPC associated with this IGW (and DNS / DNS Hostnames support)
A subnet (with a route to the IGW and Public IP creation on launch)
A placement group
Setup
Once the assets are created, go to Teradata Vantage Developer and select your configuration (make sure you select the same region as the assets you created):
From this configuration, launch the CloudFormation template and fill the parameters with the AWS assets you created:
The template will create two nodes by default. We will connect to one of them to create a test user: $ ssh ec2-user@[your_public_hostname]
ec2-user@SMP001-01:~> bteq
Teradata BTEQ 16.20.00.01 for LINUX. PID: 17205
Copyright 1984-2017, Teradata Corporation. ALL RIGHTS RESERVED.
Enter your logon or BTEQ command:
.LOGON
UserId: dbc
Password: [password_set_in_cf_template]
CREATE user test AS password=test perm = 200000000, SPOOL = 100000000;
You can then connect with your favorite SQL editor to test the connection and run table creation: CREATE TABLE "test".TEST_TABLE (
COLUMN1 VARCHAR(100),
COLUMN2 VARCHAR(100)
) ;
Step 2: Create a custom image with Teradata driver
I created the following image by following the steps the documentation on custom engines: #Dockerfile
FROM docker.repository.cloudera.com/cdsw/engine:11-cml-2020.04-1
WORKDIR /tmp
#The RUN commands that install an editor
#For example: RUN apt-get install myeditor
RUN apt-get autoremove unixodbc -y
RUN apt-get update -y
RUN apt-get install lib32stdc++6 -y
RUN apt-get install wget -y
RUN wget [LOCATION_OF_DRIVERS]
RUN tar -xzvf tdodbc1620__ubuntu_indep.16.20.00.87-1.tar.gz
RUN dpkg -i tdodbc1620/tdodbc1620-16.20.00.87-1.noarch.deb
RUN apt-get install -y python-pyodbc A few notes on this image:
I removed the unixodbc because I read that it was causing issues with pyodbc but it may not be required
You can find a built image on my dockerhub here
Finally, connect to CML and add this new engine:
Step 3: Configure and run python to Teradata ODBC
Go to your workbench in a new project, and create a session with the engine you created.
Then run the following: pip3 install pyodbc
After it is installed, go to your odbc.ini file: vi /home/cdsw/.odbc.ini
Configure your file as follows: [ODBC Data Sources]
Teradata ODBC DSN = Teradata ODBC Driver 16.20
TEST = My DSN Description
[TEST]
Driver = /opt/teradata/client/16.20/odbc_64/lib/tdataodbc_sb64.so
Description = Teradata database
DBCName = [your_teradata_hostname]
UID = test
PWD = test
Finally, you can run the following code to test: import pyodbc
import pandas as pd
conn = pyodbc.connect('DSN=TEST')
# Define Cursor
cus=conn.cursor()
query = "select * from test.TEST_TABLE;"
# Execute SQL statement to get current datetime and store result in cursor
cus.execute(query)
# Display the content of cursor
row = cus.fetchone()
print(row)
# Use Pandas to execute and retrieve results
df = pd.read_sql(query, conn)
print(df)
The output in the workbench should look something like the following:
... View more
05-29-2020
08:05 AM
2 Kudos
Cloudera Data Platform recently introduced Cluster Connectivity Manager (CCM) in tech preview, a feature with which CDP can communicate with workload clusters without the need for inbound network connections to customer workload clusters.
In this article, I will highlight how to automate the setup of CCM for both AWS and Azure including:
Cloud Network Setup
CDP Environment setup
Before you continue reading:
You can find all automation on my github (here)
Please refer to the official documentation for more details (here)
This is just an example, different deployments require a different set of parameters!
Overview
As highlighted in the above image, CCM uses reverse tunneling to communicate back to your CDP dedicated control plane.
A few things to note:
Each tenant (i.e. customer) will have their own dedicated tunnel using a dedicated Apache Mina server
The diagram above describes a setup in AWS. The Azure setup is very similar but does not require public subnets or NAT gateways.
Network Setup
Note: for both setup, you will notice that the open port 22 is inbound for a specific CIDR. This is to enable SSH to the different boxes for CDP and can be restricted to a CIDR within the VPC.
AWS
For AWS, we will need the following setup:
One VPC
3 public networks
1 internet gateway
1 public route from a public network to IGW
3 private networks
3 private network to NAT gateway
2 security groups
The following is a sample code:
#!/bin/bash
display_usage() {
echo "
Usage:
$(basename "$0") [--help or -h] <prefix> <region> <sg_cidr>
Description:
Creates network assets for CDP env demployment
Arguments:
prefix: prefix of your assets
region: AWS region
sg_cidr: CIDR to open in your security group
--help or -h: displays this help"
}
# check whether user had supplied -h or --help . If yes display usage
if [[ ( $1 == "--help") || $1 == "-h" ]]
then
display_usage
exit 0
fi
# Check the numbers of arguments
if [ $# -lt 3 ]
then
echo "Not enough arguments!" >&2
display_usage
exit 1
fi
if [ $# -gt 3 ]
then
echo "Too many arguments!" >&2
display_usage
exit 1
fi
prefix=$1
region=$2
sg_cidr=$3
# 1. Creating VPC
vpc_id=$(aws ec2 create-vpc --cidr 10.10.0.0/16 | jq -r .Vpc.VpcId)
aws ec2 create-tags --resources $vpc_id --tags Key=Name,Value="$prefix-cdp-vpc" > /dev/null 2>&1
# 2. Creating public subnets
# 2.1. Subnets
public_sub_1=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.0.0/24 --availability-zone "$region"a | jq -r .Subnet.SubnetId)
public_sub_2=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.1.0/24 --availability-zone "$region"b | jq -r .Subnet.SubnetId)
public_sub_3=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.2.0/24 --availability-zone "$region"c | jq -r .Subnet.SubnetId)
aws ec2 create-tags --resources $public_sub_1 --tags Key=Name,Value="$prefix-pub-subnet-1" > /dev/null 2>&1
aws ec2 create-tags --resources $public_sub_2 --tags Key=Name,Value="$prefix-pub-subnet-2" > /dev/null 2>&1
aws ec2 create-tags --resources $public_sub_3 --tags Key=Name,Value="$prefix-pub-subnet-3" > /dev/null 2>&1
# 2.2. Internet gateway
igw_id=$(aws ec2 create-internet-gateway | jq -r .InternetGateway.InternetGatewayId)
aws ec2 create-tags --resources $igw_id --tags Key=Name,Value="$prefix-igw"
aws ec2 attach-internet-gateway --internet-gateway-id $igw_id --vpc-id $vpc_id > /dev/null 2>&1
aws ec2 modify-vpc-attribute --enable-dns-support "{\"Value\":true}" --vpc-id $vpc_id > /dev/null 2>&1
aws ec2 modify-vpc-attribute --enable-dns-hostnames "{\"Value\":true}" --vpc-id $vpc_id > /dev/null 2>&1
# 2.3. Route
route_pub=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
aws ec2 create-tags --resources $route_pub --tags Key=Name,Value="$prefix-pub-route" > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_pub --destination-cidr-block 0.0.0.0/0 --gateway-id $igw_id > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $public_sub_1 --route-table-id $route_pub > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $public_sub_2 --route-table-id $route_pub > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $public_sub_3 --route-table-id $route_pub > /dev/null 2>&1
# 3. Creating private subnets
# 3.1. Subnets
private_sub_1=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.160.0/19 --availability-zone "$region"a | jq -r .Subnet.SubnetId)
private_sub_2=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.192.0/19 --availability-zone "$region"b | jq -r .Subnet.SubnetId)
private_sub_3=$(aws ec2 create-subnet --vpc-id $vpc_id --cidr-block 10.10.224.0/19 --availability-zone "$region"c | jq -r .Subnet.SubnetId)
aws ec2 create-tags --resources $private_sub_1 --tags Key=Name,Value="$prefix-priv-subnet-1" > /dev/null 2>&1
aws ec2 create-tags --resources $private_sub_2 --tags Key=Name,Value="$prefix-priv-subnet-2" > /dev/null 2>&1
aws ec2 create-tags --resources $private_sub_3 --tags Key=Name,Value="$prefix-priv-subnet-3" > /dev/null 2>&1
# 3.1. NAT gateways
alloc_id_1=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId)
alloc_id_2=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId)
alloc_id_3=$(aws ec2 allocate-address --domain vpc | jq -r .AllocationId)
nat_1=$(aws ec2 create-nat-gateway --subnet-id $public_sub_1 --allocation-id $alloc_id_1 | jq -r .NatGateway.NatGatewayId)
sleep 30
nat_2=$(aws ec2 create-nat-gateway --subnet-id $public_sub_2 --allocation-id $alloc_id_2 | jq -r .NatGateway.NatGatewayId)
sleep 30
nat_3=$(aws ec2 create-nat-gateway --subnet-id $public_sub_3 --allocation-id $alloc_id_3 | jq -r .NatGateway.NatGatewayId)
sleep 30
# 3.2. Routes
route_priv_1=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
route_priv_2=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
route_priv_3=$(aws ec2 create-route-table --vpc-id $vpc_id | jq -r .RouteTable.RouteTableId)
aws ec2 create-tags --resources $route_priv_1 --tags Key=Name,Value="$prefix-priv-route-1" > /dev/null 2>&1
aws ec2 create-tags --resources $route_priv_2 --tags Key=Name,Value="$prefix-priv-route-2" > /dev/null 2>&1
aws ec2 create-tags --resources $route_priv_3 --tags Key=Name,Value="$prefix-priv-route-3" > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_priv_1 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_1 > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_priv_2 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_2 > /dev/null 2>&1
aws ec2 create-route --route-table-id $route_priv_3 --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $nat_3 > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $private_sub_1 --route-table-id $route_priv_1 > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $private_sub_2 --route-table-id $route_priv_2 > /dev/null 2>&1
aws ec2 associate-route-table --subnet-id $private_sub_3 --route-table-id $route_priv_3 > /dev/null 2>&1
# 4. VPC endpoints
s3_endpoint=$(aws ec2 create-vpc-endpoint --vpc-id $vpc_id --service-name com.amazonaws.${region}.s3 | jq -r .VpcEndpoint.VpcEndpointId)
dynamo_endpoint=$(aws ec2 create-vpc-endpoint --vpc-id $vpc_id --service-name com.amazonaws.${region}.dynamodb | jq -r .VpcEndpoint.VpcEndpointId)
aws ec2 modify-vpc-endpoint --vpc-endpoint-id $s3_endpoint --add-route-table-ids $route_pub $route_priv_1 $route_priv_2 $route_priv_3 > /dev/null 2>&1
aws ec2 modify-vpc-endpoint --vpc-endpoint-id $dynamo_endpoint --add-route-table-ids $route_pub $route_priv_1 $route_priv_2 $route_priv_3 > /dev/null 2>&1
# 5. Security groups
knox_sg_id=$(aws ec2 create-security-group --description "AWS CDP Knox security group" --group-name "$prefix-knox-sg" --vpc-id $vpc_id | jq -r .GroupId)
aws ec2 create-tags --resources $knox_sg_id --tags Key=Name,Value="$prefix-knox-sg" > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $knox_sg_id --protocol tcp --port 22 --cidr $sg_cidr
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
default_sg_id=$(aws ec2 create-security-group --description "AWS default security group" --group-name "$prefix-default-sg" --vpc-id $vpc_id | jq -r .GroupId)
aws ec2 create-tags --resources $default_sg_id --tags Key=Name,Value="$prefix-default-sg"
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.0.0/16 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol tcp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
aws ec2 authorize-security-group-ingress --group-id $default_sg_id --protocol udp --port 0-65535 --cidr 10.10.224.0/19 > /dev/null 2>&1
echo "{\"VpcId\": \"$vpc_id\",
\"InternetGatewayId\": \"$igw_id\",
\"PublicSubnets\": [\"$public_sub_1\", \"$public_sub_2\", \"$public_sub_3\"],
\"PublicRouteTableId\": \"$route_pub\",
\"PublicNatGatewayIds\": [\"$nat_1\", \"$nat_2\", \"$nat_3\"],
\"PrivateSubnets\": [\"$private_sub_1\", \"$private_sub_2\", \"$private_sub_3\"],
\"PrivateRouteTableIds\": [\"$route_priv_1\", \"$route_priv_2\", \"$route_priv_3\"],
\"VPCEndpoints\": [\"$s3_endpoint\", \"$dynamo_endpoint\"],
\"KnoxGroupId\": \"$knox_sg_id\" ,
\"DefaultGroupId\": \"$default_sg_id\"}"
Azure
For Azure, we will need the following setup:
One vnet
3 private subnets
2 security groups
The following is a sample code:
#!/bin/bash
display_usage() {
echo "
Usage:
$(basename "$0") [--help or -h] <prefix> <sg_cidr>
Description:
Creates network assets for CDP env demployment
Arguments:
prefix: prefix of your assets
sg_cidr: CIDR to open in your security group
--help or -h: displays this help"
}
# check whether user had supplied -h or --help . If yes display usage
if [[ ( $1 == "--help") || $1 == "-h" ]]
then
display_usage
exit 0
fi
# Check the numbers of arguments
if [ $# -lt 2 ]
then
echo "Not enough arguments!" >&2
display_usage
exit 1
fi
if [ $# -gt 2 ]
then
echo "Too many arguments!" >&2
display_usage
exit 1
fi
prefix=$1
sg_cidr=$2
# 1. Vnets and subnets
az network vnet create -g $prefix-cdp-rg --name $prefix-cdp-vnet --address-prefix 10.10.0.0/16
az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-1 --address-prefixes 10.10.160.0/19
az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-2 --address-prefixes 10.10.192.0/19
az network vnet subnet create -g $prefix-cdp-rg --vnet-name $prefix-cdp-vnet -n $prefix-priv-subnet-3 --address-prefixes 10.10.224.0/19
az network vnet subnet update -n $prefix-priv-subnet-1 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage"
az network vnet subnet update -n $prefix-priv-subnet-2 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage"
az network vnet subnet update -n $prefix-priv-subnet-3 --vnet-name $prefix-cdp-vnet -g $prefix-cdp-rg --service-endpoints "Microsoft.Sql" "Microsoft.Storage"
# 2. NSG
az network nsg create -g $prefix-cdp-rg -n $prefix-knox-nsg
az network nsg create -g $prefix-cdp-rg -n $prefix-default-nsg
az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-knox-nsg -n ssh_cidr --priority 102 --source-address-prefixes "$sg_cidr" --destination-address-prefixes '*' --destination-port-ranges 22 --direction Inbound --access Allow --protocol Tcp --description "Allow SSH to boxes from CIDR."
az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-knox-nsg -n outbound --priority 107 --source-address-prefixes '*' --destination-address-prefixes '*' --destination-port-ranges '*' --direction Outbound --access Allow --protocol '*' --description "Allow outbound access."
az network nsg rule create -g $prefix-cdp-rg --nsg-name $prefix-default-nsg -n outbound --priority 107 --source-address-prefixes '*' --destination-address-prefixes '*' --destination-port-ranges '*' --direction Outbound --access Allow --protocol '*' --description "Allow outbound access."
CDP setup
Note: This setup is using the CDP CLI. You could also select these networks from the UI directly. It also assumes the creation of a proper IAM and storage, which can also be automated in my github.
AWS
cdp environments create-aws-environment --environment-name ${prefix}-cdp-env \
--credential-name ${credential} \
--region ${region} \
--security-access securityGroupIdForKnox="${knox_sg_id}",defaultSecurityGroupId="${default_sg_id}" \
--authentication publicKeyId="${key}" \
--log-storage storageLocationBase="${prefix}-cdp-bucket",instanceProfile="arn:aws:iam::$AWS_ACCOUNT_ID:instance-profile/${prefix}-log-role" \
--subnet-ids "${pub_sub_1}" "${pub_sub_2}" "${pub_sub_3}" "${priv_sub_1}" "${priv_sub_2}" "${priv_sub_3}" \
--vpc-id "${vpc}" \
--s3-guard-table-name ${prefix}-cdp-table \
--enable-tunnel \
--tags key="enddate",value="${END_DATE}" key="project",value="${PROJECT}"
Azure
cdp environments create-azure-environment --environment-name ${prefix}-cdp-env \
--credential-name ${credential} \
--region "${region}" \
--public-key "${key}" \
--security-access securityGroupIdForKnox="$knox_nsg",defaultSecurityGroupId="$default_nsg" \
--log-storage storageLocationBase="abfs://logs@${prefix}cdpsa.dfs.core.windows.net",managedIdentity="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${prefix}-cdp-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/loggerIdentity" \
--existing-network-params networkId="$network_id",resourceGroupName="$prefix-cdp-rg",subnetIds="$subnet_1","$subnet_2","$subnet_3" \
--tags key="enddate",value="${END_DATE}" key="project",value="${PROJECT}" \
--no-use-public-ip \
--enable-tunnel
... View more
Labels:
05-21-2020
06:24 AM
1 Kudo
Continuing my series of how-to articles for CDP, today we explore how to connect to Impala via JDBC in JSON. In my example, I will use a Jupyter notebook running in CML, but this can be generalized.
This process is actually fairly easy, so let's dive in.
Step 1: Setup Impala JDBC drivers
First, download the latest impala JDBC drivers from Cloudera JDBC Driver 2.6.17 for Impala.
Then, upload them to your machine. Here is an example of a CML Jupyter sessions with the jars uploaded:
Finally, make sure that you set up your CLASSPATH properly by opening a terminal session and typing the following: CLASSPATH=.:/home/cdsw/ImpalaJDBC4.jar:/home/cdsw/ImpalaJDBC41.jar:/home/cdsw/ImpalaJDBC42.jar
export CLASSPATH
Step 2: Install JayDeBeApi
To install JayDeBeApi, run the following: pip3 install JayDeBeApi
A recommended step to avoid getting an error along the lines of "AttributeError: type object 'java.sql.Types' has no attribute '__javaclass__'", would be to downgrade your jpype by running the following: pip3 install --upgrade jpype1==0.6.3 --user
Restart your kernel when you perform the downgrade.
Step 3: Connect to Impala
Finally, connect to your impala, using the following sample code: import jaydebeapi
conn = jaydebeapi.connect("com.cloudera.impala.jdbc.DataSource",
"jdbc:impala://[your_host]:443/;ssl=1;transportMode=http;httpPath=icml-data-mart/cdp-proxy-api/impala;AuthMech=3;",
{'UID': "[your_cdp_user]", 'PWD': "[your_workload_pwd]"},
'/home/cdsw/ImpalaJDBC41.jar')
curs = conn.cursor()
curs.execute("select * from default.locations")
curs.fetchall()
curs.close()
conn.close() Note: You can get your impala JDBC string either from the Datahub endpoint path or from the JDBC URL from CDW.
The following is a screenshot of my code in action:
... View more
05-11-2020
06:55 AM
Hi Abdul, That's a very good question. To answer it, you need to broaden the horizon a bit from just Navigator and Atlas, and consider Atlas, Ranger, Sentry and Navigator. In short, all functionalities you have in CDH (using Navigator, Sentry) will be available to you in CDP (using Atlas and Ranger). Here is a high level mapping of the different tools in our different platforms: Function Navigator (CDH) Sentry (CDH) Atlas (CDP) Ranger (CDP) Data Lineage X X Audits X X RBAC X X ABAC Not Supported X Of course, Atlas gives you a lot of extra functionalities not available to you before such as a more scalable back-end, open apis for integration, large apache community support. I recommend exploring our product pages to get a better understanding: https://www.cloudera.com/products/open-source/apache-hadoop/apache-atlas.html
... View more
05-08-2020
10:36 AM
2 Kudos
Here is a fun one: how do you connect from Python in Cloudera Machine Learning to our Kafka datahub cluster? The documentation is pretty thorough, but it does not have an example of a python client. That's what I'm going to highlight in this article. The good news is that since CML and Datahub run in the same network, you don't need to worry about opening the broker ports, therefore you just need to follow these steps: Step 1: Get and upload your freeIPA certificate Step 2: Find your broker hostnames Step 3: Setup your client Step 1: Get and upload your freeIPA certificate Go to your management console > your environment > Actions > Get FreeIPA Certificate: Once downloaded, go to your CML workspace, and upload your file (e.g. /home/cdsw/ca.crt). Step 2: Find your broker hostnames For this, go to your Kafka Datahub Cluster > CM UI > Kafka > Instances; you can find the broker hosts here: Step 3: Setup your client Then, open a session in CML, and use the following parameters: from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['<YOUR_BROKER_URL>:9093','<YOUR_BROKER_URL>:9093','<YOUR_BROKER_URL>:9093'],
security_protocol="SASL_SSL",
sasl_mechanism="PLAIN",
ssl_check_hostname=True,
ssl_cafile='/home/cdsw/ca.crt',
sasl_plain_username="<YOUR_WORKLOAD_USER>",
sasl_plain_password="<YOUR_WORKLOAD_PASSWORD>",
api_version_auto_timeout_ms=30000)
... View more
05-05-2020
06:24 PM
5 Kudos
Recently I came around an interesting problem: how to use boto to get data from a secure bucket in a Jupyter notebook in Cloudera Machine Learning.
The missing piece was: I needed to get my code integrated with my AWS permissions given by IDBroker.
Since CML already authenticated me to Kerberos, all I need was getting the goods from IDBroker.
In this article, I will show you pseudo code on how to get these access keys both in bash and python.
Note: Special thanks to @Kevin Risden to whom I owe this article and many more things.
Find your IDBroker URL
Regardless of the method, you will need to get the URL for your IDBroker host. This is done simply in the management console of your datalake. The following is an example:
Getting Access Keys in bash
After you are connected to one of your cluster's node and ensure you kinit, run the following:
IDBROKER_DT="$(curl -s --negotiate -u: "https:/[IDBROKER_HOST]:8444/gateway/dt/knoxtoken/api/v1/token")" IDBROKER_ACCESS_TOKEN="$(echo "$IDBROKER_DT" | python -c "import json,sys; print(json.load(sys.stdin)['access_token'])")" IDBROKER_CREDENTIAL_OUTPUT="$(curl -s -H "Authorization: Bearer $IDBROKER_ACCESS_TOKEN" "https://[IDBROKER_HOST]:8444/gateway/aws-cab/cab/api/v1/credentials")"
The credentials can be found in the $IDBROKER_CREDENTIAL_OUTPUT variable.
Getting Access Keys in Python
Before getting started, the following libraries are installed:
pip3 install requests requests-kerberos boto3
Then, run the following code:
import requests from requests_kerberos import HTTPKerberosAuth r = requests.get("https://[IDBROKER_URL]:8444/gateway/dt/knoxtoken/api/v1/token", auth=HTTPKerberosAuth()) url = "https://[IDBROKER_URL]:8444/gateway/aws-cab/cab/api/v1/credentials" headers = { 'Authorization': "Bearer "+ r.json()['access_token'], 'cache-control': "no-cache" } response = requests.request("GET", url, headers=headers) ACCESS_KEY=response.json()['Credentials']['AccessKeyId'] SECRET_KEY=response.json()['Credentials']['SecretAccessKey'] SESSION_TOKEN=response.json()['Credentials']['SessionToken'] import boto3 client = boto3.client( 's3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, aws_session_token=SESSION_TOKEN, )
You can then access your buckets via the following:
data = client.get_object(Bucket='[YOUR_BUCKET]', Key='[FILE_PATH]') contents = data['Body'].read()
... View more